SHAPE—an approach for self-healing and self-protection in complex distributed networks

29
J Supercomput (2014) 67:585–613 DOI 10.1007/s11227-013-1019-3 SHAPE—an approach for self-healing and self-protection in complex distributed networks Inderpreet Chopra · Maninder Singh Published online: 24 September 2013 © Springer Science+Business Media New York 2013 Abstract Increasing complexity of large scale distributed systems is creating prob- lem in managing faults and security attacks because of the manual style adopted for management. This paper proposes a novel approach called SHAPE to self-heal and self-protect the system from various kinds of faults and security attacks. It deals with hardware, software, and network faults and provides security against DDoS, R2L, U2L, and probing attacks. SHAPE is implemented and evaluated against vari- ous standard metrics. The results are provided to support the approach. Keywords Grid computing · Cloud computing · Security · Fault tolerance 1 Introduction Distributed system provides a seamless integration of computing functions between different computers to obtain the large-scale resource sharing at affordable cost. A growing appetite for computational power in scientific research and computational discovery has resulted in conditions that are favorable for the evolution of complex distributed systems. Complex distributed systems are those that involve large number of heterogeneous resources that keep on changing their state. This often increases the probability of resources to be compromised or fail as compared to traditional distributed systems. This paper targets two of the most complex and widely used distributed networks: grid computing [1] and cloud computing [2]. Grid computing emerged in the early I. Chopra (B ) Thapar University, Patiala, India e-mail: [email protected] M. Singh CSED, Thapar University, Patiala, India e-mail: [email protected]

Transcript of SHAPE—an approach for self-healing and self-protection in complex distributed networks

J Supercomput (2014) 67:585–613DOI 10.1007/s11227-013-1019-3

SHAPE—an approach for self-healingand self-protection in complex distributed networks

Inderpreet Chopra · Maninder Singh

Published online: 24 September 2013© Springer Science+Business Media New York 2013

Abstract Increasing complexity of large scale distributed systems is creating prob-lem in managing faults and security attacks because of the manual style adoptedfor management. This paper proposes a novel approach called SHAPE to self-healand self-protect the system from various kinds of faults and security attacks. It dealswith hardware, software, and network faults and provides security against DDoS,R2L, U2L, and probing attacks. SHAPE is implemented and evaluated against vari-ous standard metrics. The results are provided to support the approach.

Keywords Grid computing · Cloud computing · Security · Fault tolerance

1 Introduction

Distributed system provides a seamless integration of computing functions betweendifferent computers to obtain the large-scale resource sharing at affordable cost.A growing appetite for computational power in scientific research and computationaldiscovery has resulted in conditions that are favorable for the evolution of complexdistributed systems. Complex distributed systems are those that involve large numberof heterogeneous resources that keep on changing their state. This often increasesthe probability of resources to be compromised or fail as compared to traditionaldistributed systems.

This paper targets two of the most complex and widely used distributed networks:grid computing [1] and cloud computing [2]. Grid computing emerged in the early

I. Chopra (B)Thapar University, Patiala, Indiae-mail: [email protected]

M. SinghCSED, Thapar University, Patiala, Indiae-mail: [email protected]

586 I. Chopra, M. Singh

1990s, as high performance computers were interconnected via fast data commu-nication links, with the aim of supporting complex calculations and data-intensivescientific applications. Grid computing connects a wide variety of heterogeneous re-sources like computers and computing resources such as printers, desktops, laptops,databases, storage area networks, etc. to create vast virtual reservoirs of computersserving geographically widely separated users. In contrast, cloud computing has re-sulted from the convergence of grid computing, utility computing and Software asa Service (SaaS) [3], and essentially represents the increasing trend toward the ex-ternal deployment of IT resources, such as computational power, storage or businessapplications, and obtaining them as services.

As large number of dynamic heterogeneous resources are involved in setting upgrid and cloud framework, this increases the failures and security concerns [4]. Theskilled persons who manage these systems are expensive and it becomes difficultfor them to manually manage configuration, healing, optimization, protection, andmaintenance when resources keep varying [5]. For this, one major concern is themanagement of resources automatically. This can be achieved by using the conceptof Autonomic Computing given by IBM [6].

Autonomic computing [7] provides a self-* concept to address all the issues likeself-configuration, self-healing, self-protection, and self-optimization.

– self-configuring: the ability to readjust itself on-the fly– self-healing: discover, diagnose, and react to disruptions– self-optimization: maximize resource utilization to meet end-user needs– self-protection: anticipate, detect, identify, and protect it from attacks

The focus of this paper is on the self-healing and self-protection of the grid andcloud system. Managing these systems is a complex task as these systems involvecomputing nodes that can join and leave the system at any time without any dedicatedcommitment.

SHAPE, i.e., self-healing and protection environment model presented in this pa-per, offers automated way of handling failures and provide protection from variouskinds of security attacks. Self-healing will make the system to detect and recoverfrom potential problems and continue to function smoothly. A self-protecting sys-tem will be capable of detecting and protecting its resources from both internal andexternal attacks.

1.1 Major contributions

Contributions of the research reported in this paper are:

1. An important problem of automation of distributed system management in aspectof recovery from different kind of faults, as well as from number of security at-tacks is discussed.

2. A novel approach called SHAPE is presented that provides the platform for addingself-healing and self-protection capabilities to any distributed system. It is de-signed using component based architecture in which one can easily add or removenew components.

3. It is implemented purely using open source technologies.

SHAPE—an approach for self-healing and self-protection 587

4. It has built its capability to handle network, software, and hardware related faults.It also hardens the system so as to reduce the frequency of fault occurrence.

5. It provides the feature for auto generation of signatures against four kinds of net-work attacks- (Distributed denial of Service) DDoS, (Remote to Local) R2L, (Userto Root) U2R, and Probing.

6. It has been evaluated using standard metrics for failures and security attacks in agrid environment. This includes throughput, turn around time, waiting time, detec-tion rate, false positive rate based validations. Results show that SHAPE increasesthe job execution rates by reducing the security attacks and failures in the system.

Section 2 defines self-healing and the kind of failures that can hinder the performanceof complex distributed systems. Section 3 deals with and discusses self-protectionand common vulnerabilities. Section 4, describes the SHAPE architecture and itsworking. Implementation and results are presented in Sect. 5. Finally, we end withconclusions and future works in Sect. 6.

2 Self-healing

Self-healing is the ability of systems to heal themselves of system faults and to sur-vive malicious attacks. This is analogous to the manner in which a biological systemheals a wound. This enables a system to perceive that it is not functioning properlyand make the necessary alterations to regain normative performance levels [8]. It ismandatory for a self-healing system to have the ability to recover from a failed com-ponent by detecting and isolating the failed component, fixing and reestablishing thefixed or replacement component into service without any apparent overall disruption.A self-healing system must foresee problems and take necessary actions to ensurethat the failure does not affect the applications [9]. There can be various reasons thatcan lead to the faults. Some of the reasons we are able to find are as follows:

– Hardware faults: Hardware failures take place due to faulty hardware componentssuch as CPU, memory, and storage devices.

– Software faults: There are several high resource intensive applications running onthese systems to do particular tasks. Several software failures like the unhandledexception, unexpected input, etc. can take place while running this software ap-plication. The reason behind this can be memory leakage, deadlocks, inefficientresource management, etc.

– Network faults: In complex networks, computing resources are connected overmultiple and different types of distributed networks. As a result, physical dam-age or operational faults in the network are more likely. The network may exhibitsignificant packet loss or packet corruption. Moreover, individual nodes in the net-work or the whole network may go down.

2.1 Recent work

There is limited research work done in the area of “self-healing”. For most of theself-healing systems, fault monitors act as the integral part. The main role of moni-tor is to look for the correct functioning of system. The base technique, which most

588 I. Chopra, M. Singh

of the monitoring units follow is the heartbeat. Heartbeat is further classified intocentralized ring and all-to-all heartbeating [10]. A new approach called applicationheartbeat [11] is prevalent nowadays. Its goal is to manage the performance of soft-ware applications, which have been instrumented to emit their performance level viathe application heartbeat framework [12]. By making calls to the heartbeat API, ap-plications signal “heartbeats” at some important places in the code. Additional func-tions in the heartbeat interface allow applications to specify their goals in terms of adesired heart rate [13]. The decision making process should assign operating systemresources to each instrumented application in order for the application to match thespecified performance level.

With the momentum gaining for the grid and cloud computing systems, the is-sue of deploying support for integrated scheduling and fault tolerant approaches isanother way to handle failures. For this, most of the fault tolerant scheduling algo-rithms are using the coupling of scheduling policies with the job replication schemessuch that jobs are efficiently and reliably executed. One recent development in faulttolerant scheduling is Adaptive Job Replication (AJR) and Backup Resources Selec-tion (BRS) [14]. It employs job replication as an effective approach for achieving anefficient fault-tolerant and scheduling system. Most of work done using the replica-tion based approach assumes using a fixed number of replications for each job, whichconsumes more resources. The more replicas imply more resource consumption andhigher economic cost.

To address this problem with the target to satisfy the user’s reliability require-ment with minimum resources, the “MaxRe” algorithm is introduced [15]. MaxReincorporates the reliability analysis into the active replication schema, and exploits adynamic number of replicas for different tasks.

SHAPE uses a very different approach for dealing with failures. It uses variousagents to protect the system from hardware, software, and network failures. Whiledoing a literature review, we have not come across any system that handles all thethree types of failure together, especially hardware failures in a distributed environ-ment. Driver hardening and a machine log analyzer based approach is used to handlehardware failures. For software and network failures, SHAPE uses a modified heart-beat based model.

3 Self-protection

A self-protecting system helps to detect and identify hostile behaviour and take au-tonomous actions to protect itself against intrusive behavior. The main goal of self-protection system is to defend environment against malicious intentional actions byscanning the suspicious activities and react accordingly without the user’s awarenessthat such protection is in process [16].

The main design principles required to build a self-protected [9] system are sum-marized below:

1. A self-protected system must be able to detect intrusions. It requires a definitionof its own operations: This is the sense of self capacity or the self-knowledgeaspect. In other words, it must be able to distinguish legal behaviors from illegalbehaviors.

SHAPE—an approach for self-healing and self-protection 589

2. The system must have the ability to respond to attacks. Whenever an attack isdetected, the system should have the capability to block the attack or log alert.

3. The system must prevent the self-protection components from being compro-mised.

Kind of vulnerabilities against which we need self-protection are:DoS attacks these continue to be the main threat. [17] says 45 % of surveyed data

center operators experienced DDoS attacks against their data centers, up 60 % fromthe prior year and, of these, 94 % are seeing DDoS attacks regularly. In such attacks,a set of attackers generate a huge traffic, saturating the victim’s network, and causingsignificant damage [18]. These include:

– SMURF: In the “smurf” attack, attackers use ICMP echo request packets directedto IP broadcast addresses from remote locations to create a denial-of-service attack.

– LAND: The land attack is a denial of service attack that is effective against someolder TCP/IP implementations. The land attack occurs when an attacker sends aspoofed SYN packet in which the source address is the same as the destinationaddress.

– SYN Flood (Neptune): A SYN flood is a denial of service attack to which ev-ery TCP/IP implementation is vulnerable (to some degree). Each half-open TCPconnection made to a machine causes the “tcpd” server to add a record to thedata structure that stores information describing all pending connections. This datastructure is of finite size, and it can be made to overflow by intentionally creatingtoo many partially-open connections. The half-open connections data structure onthe victim server system will eventually fill and the system will be unable to ac-cept any new incoming connections until the table is emptied out. Normally thereis a timeout associated with a pending connection, so the half-open connectionswill eventually expire and the victim server system will recover. However, the at-tacking system can simply continue sending IP-spoofed packets requesting newconnections faster than the victim system can expire the pending connections. Insome cases, the system may exhaust memory, crash, or be rendered otherwise in-operative.

– Teardrop: The teardrop exploit is a denial of service attack that exploits a flaw inthe implementation of older TCP/IP stacks. Some implementations of the IP frag-mentation re-assembly code on these platforms do not properly handle overlappingIP fragments.

Insider attacks from the inside carry the potential for significant damage that canreveal or even exceed the damage caused by external forces. As an integral andtrusted member of the organization, the perpetrator carries valid authorization andtypically enjoys relatively unchallenged presence and movement within the organi-zation’s IT infrastructure. The attacks typically target specific information and exploitestablished entry points or obscure vulnerabilities. In many respects, insider attackscan be more difficult to detect than penetration attempts from the outside[19]. Since2001, over 700 cases of actual insider crimes have been collected and analyzed byCERT researchers. The crimes collected range across multiple sectors, include smallcompanies to multinational corporations, and cover several hundred types of exploitsused by malicious insiders to harm an organization.

590 I. Chopra, M. Singh

Remote to local (R2L) attacks in which an unauthorized user is able to bypassnormal authentication and execute commands on the target [20].

– Guess Password– IMAP– SPY

User to root (U2R) attacks in which a user with login access is able to bypassnormal authentication to gain the privileges of another user, usually root. This canalso be the case when the user tries to use some other resources assigned to him [20].

– Buffer Overflow: Buffer overflows occur when a program copies too much datainto a static buffer without checking to make sure that the data will fit.

– Rootkits: A rootkit is a stealthy type of software, often malicious, designed to hidethe existence of certain processes or programs from normal methods of detectionand enable continued privileged access to a computer [22].

Probing: Programs have been distributed that can automatically scan a network ofcomputers to gather information or to find known vulnerabilities [21].

– NMAP: Nmap is a general-purpose tool for performing network scans. Nmap sup-ports many different types of portscans-options include SYN, FIN, and ACK scan-ning with both TCP and UDP, as well as ICMP (Ping) scanning [23]. The Nmapprogram also allows a user to specify which ports to scan, how much time to waitbetween each port, and whether the ports should be scanned sequentially or in arandom order.

– Ports Sweep: is to scan multiple hosts for a specific listening port.

3.1 Recent work

Recently, extensive research activities have been focused on finding new approachesfor automatically detecting and preventing intrusions in distributed networks. In thiscontext, we have figured out that intrusion detection systems are the best way to keepthe network safe. IDSs are used in order to stop attacks, recover from them withminimum loss, or analyze the security problems so that they are not repeated [24].

IDSs are broadly divided into two categories: signature based and anomaly based.Signature-based IDSs looks for the patterns in their library of known signatures, butare not effective against novel attacks. Anomaly based IDSs on the other hand an-alyze abnormal activities and flags such activities as attacks. Snort [25] is the mostcommonly used signature-based detector that runs over IP networks analyzing real-time traffic for detection of misuse [26]. Snort also provides the option to make itwork as anomaly detection IDS by using the preprocessor component. Based uponthese two approaches, many approaches are proposed to handle intrusions.

M. Ali Aydin et al. [27] proposed a hybrid IDS by combining the two ap-proaches in one system. The hybrid IDS is obtained by combining the packet headeranomaly detection (PHAD) and network traffic anomaly detection (NETAD), whichare anomaly-based IDSs with the misuse-based IDS snort. The results shows thathybrid IDS is more powerful than the signature-based. The problem that we havefigured out with this approach is that its performance will degrade when the traffic

SHAPE—an approach for self-healing and self-protection 591

on the unit running IDS increases. This is because it installs the IDS on a single unitthat will work for a single network, but for a distributed network, this approach is notgood. Yu-Xin Ding et al. [28] proposed another snort-based hybrid IDS. It is dividedinto three modules: misuse detection, anomaly detection, and signature generationmodule. Snort is used as a misuse detection module to detect known attacks. Theanomaly detection module uses the frequent episode rule mining algorithm with asliding window to generate rules for anomaly detection. Signatures of newly detectedattacks by the anomaly detection module are generated by using the signature gener-ation module. It uses the a-priori algorithm for signature generation. It provides goodperformance in offline detection, but cannot be used for real time detection.

J. Gomez et al. [29] made another attempt to use the snort preprocessor capabilityto design a system called the hybrid IDS, i.e., H-IDS. In this, the basic statisticalmethod uses moving averages corresponding to network traffic. This is a very basicmodel that tends to use some data mining techniques to predict the future performanceof the system.

Vijay Katkar and S.G. Bhirud [30] proposed a lightweight mechanism to detectnovel DoS/DDoS (resource consumption) attacks and an automatic signature gen-eration process to represent them in real time. Condition based network connectionrecords omission used for novel attack signature generation increases the speed andaccuracy. Limitation of this technique is that this is only limited to attacks that arerelated to resource consumption. No other attack like DoS, DDoS, R2L, and L2R at-tacks are taken care of. Other similar researches done on snort includes [31–33]. Forinstance, [31] models only the http traffic, [32] models the network traffic as a set ofevents and look for abnormalities in these events, [33] enhance the functionalities ofsnort to automatically generate patterns of misuse from attack data, and the ability ofdetecting sequential intrusion behaviors, [14] that is a preprocessor based on study-ing the defragmentation of the package in the network to avoid evasive attacks in theIDS.

All the techniques discussed use a centralized system. The major drawbacks of allsuch systems are high rates of false positives, low efficiency, etc., especially in thecase of distributed attacks. Many distributed agent based techniques are also devel-oped to handle all these drawbacks. Imen Brahmi et al. [34] proposes a techniquecalled DIDMAS (Distributed Intrusion Detection using Mobile Agents and Snort)that focuses only on misuse detection approach. The experimental results show theeffectiveness of this approach and highlighted the DIDMAS realizes the scalabilityof mobile agent based approaches as it reduces bandwidth consumption and also re-sponse time. The main drawback of this approach is that it is not capable of detectingany new attacks. It can detect only those attacks which are present in its signaturedatabase. One more similar approach is DIDS (Distributed Intrusion Detection Sys-tem) [35] that works on the same path as DIDMAS.

SHAPE tries to present a security approach that overcomes all the limitations dis-cussed above. SHAPE has the ability to automatically generate new signatures andgive protection against DDoS, R2L, U2R and probing attacks. Security agents arebuilt on Snort as anomaly detector. Snort is extended to generate signatures automati-cally for any new intrusions detected. This reduces the analysis time if same intrusionis attempted again.

592 I. Chopra, M. Singh

Fig. 1 SHAPE autonomic element

4 SHAPE architecture

A SHAPE is first combined self-healing and self-protection approach for complexdistributed systems. SHAPE proposes a model to automatically diagnose problemsfrom observed symptoms, and the results of the diagnosis can then be used to triggerautomated response and recovery.

In this model, SHAPE autonomic elements (agents), which manage self-healingand self-protection of network, dynamically organize management works withoutcentralized control and directions. Autonomic element consists of sensors, monitors,analyzer, planner, executor, and effector (Fig. 1). The details of each component isdiscussed in below sections. Each of SHAPE elements establishes an acquaintancerelationship to acquire data from each other to keep them updated. Based on such ac-quaintance information, they are able to form collaborations by their interactions andcontribute each member’s capability to accomplish necessary subtasks when failureor some security breach occurs. Every participant acts according to its capability andknowledge, and send results, further request, and information to others for coopera-tive work.

Together when these autonomic elements communicate, they form autonomicunit (AU). AU consists of different machines (AEs) working together to handlefailures and security breaches. One AU has single manager node and rest acts asthe processing nodes to generate data for the manager. Only managers can com-municate with the managers of another AU. Figure 2 shows the interaction be-tween the two AUs. To brief SHAPE components in mathematical terms, we de-fine:

SHAPE—an approach for self-healing and self-protection 593

Fig. 2 SHAPE autonomic unit interaction

Fig. 3 SHAPE communication

Autonomic Element, AE = {Sensor, Monitor, Analyze, Plan, Execute, Effector},Autonomic Unit, AU = {AE1,AE2, . . . ,AEN },SHAPE = {AU1,AU2, . . . ,AUN }.Communication between the manager and child node is both push and pull based.

As shown in Fig. 3, AM (manager node) can ask from the status of AE and if AE failsto do so for three times in sequence, it is treated as down. Whenever new updates areavailable, manager node pushes them to child nodes and can pull the logs from thechild nodes after specific interval. This functionality is further optimized by enabling“HTTP MTP [36]” module in JADE. This reuses connection instead of opening newones each time a message must be delivered to a remote platform.

All the updates done by manager are stored in centralized database. Manager nodealso maintains the DB replica of all the configurations as backup. Whenever the mas-

594 I. Chopra, M. Singh

Fig. 4 Fault handling agents(FHA)

ter DB goes down due to any reason, backup DB acts as master until master DB is notup again. Also, by keeping the configurations in centralized DB any AE participatingin the AU can act as the manager node in case of failure of manager.

For security of agents, we have integrated JADE-PKI [37] plugin. JADE-PKI add-on is to introduce a public key infrastructure into JADE. The add-on provides securityfor agent messaging and secures the communications between the agents.

4.1 Sensors

Sensors retrieve information about the current state of the other machines. In SHAPE,sensors are used only to transfer the processing nodes findings to the manager node.At manager node, their use is to get all the alert logs from child nodes and at childnodes, sensors help to receive updates from manager node. These updates includenew security signatures, action to restart the nodes in case of some failures, and getthe other updates regarding the network status.

4.2 Monitors

Monitors’ main task is to monitor the resource nodes for failures (hardware, software,and network) or any suspicious activities. Monitor comprises of two types of agents:Fault Handling Agents (FHA) and Security Agents (SA).

4.2.1 Fault handling agents (FHA)

FHA helps in providing the self-healing capability to the system. In SHAPE, we aretargeting to handle three kinds of failures: hardware, software and network. To handlethese, we have three separate agents working in coordination with each other.

As shown in Fig. 4, FHA includes the following agents.

Hardware agents The hardware agent is to keep the environment aware of any hard-ware failures. They also consider ways to reduce hardware failures occurrence and tohandle the job executions in case of failures.

Hardware hardening agent (HHA) [Algorithm 1] HHA targets to reduce the failurerate because of any hardware cut downs. Whenever the node is added into the net-work, this agent checks for its device drivers and tries to harden them. The devicedriver acts as an interface between hardware and the application in any system. Thedevice and driver interact through a protocol specified by the hardware. When thedevice obeys the specification, a driver may trust any inputs it receives [39]. Device

SHAPE—an approach for self-healing and self-protection 595

Algorithm 1 Hardware Hardening Agents (HHA)1: BEGIN2: HdN {N1,N2,N3, . . . ,Ni} be the set of hardened Node //MTP implemented us-

ing JADE is used for communication.3: Let NC be current node added into the system4: if NC ∩ HdN is NULL then5: START HHA hardening process6: Scan drivers and generate list of drivers to be hardened, Li

7: Create replica of original drivers that are going to be hardened8: for all Li do9: Harden driver using carburizer engine

10: end for11: ADD Nc → HdN

12: else13: NODE is already hardened14: end if

hardware failures cause system hangs or crashes when drivers cannot detect or tol-erate the failure. The Linux kernel mailing list contains numerous reports of driverswaiting forever and reminders from kernel experts to avoid infinite waits [38].

SHAPE uses the concept of carburizer [39] to harden the device drivers. Hardeningis the process in which the driver works correctly even though faults occur in thedevice that it controls or other faults originating outside the device. A hardened drivershould not hang the system, or allow the uncontrolled spread of corrupted data asthe result of any such faults. Our implementation is different from the carburizer inthe way that our system is based upon the agent based architecture. Secondly, HHAgenerates reports listing all hardened device drivers and updates the manager node.CIL (C Intermediate Language) is used to design this component. CIL is a high-levelrepresentation along with a set of tools that permit easy analysis and source-to-sourcetransformation of C programs.

Whenever the node is registered, the SHAPE driver hardening agent pushes thecode on that node. The complete command is executed through the SSH session.Once the hardening process is over, HHA updates the nodes status to manager. Fig-ure 5 shows the working of HHA. HHA generates replica of original drivers and logdetail for the drivers to be hardened, whenever it is going to harden those drivers. Theharden process includes first scanning the source code of all the drivers and finds outthe code where the chances of failure are. Once the code is identified, it starts replac-ing the code so as to harden the driver. After the driver is hardened, original driversare replaced with the hardened drivers. A run time monitoring HHA component isalso added to every hardened driver. This monitoring component keeps track over theproper functioning of driver. If any alerts are raised because of the misbehavior of thedriver, the hardened driver is replaced with the original driver and the manager nodeis updated. In such scenarios, manual intervention of the programmer is needed. Oncethe problem is resolved, the alert raised is manually resolved by the programmer andhardened driver is again updated. The fix related information is kept in the databasefor future reference.

596 I. Chopra, M. Singh

Fig. 5 Hardware hardening agent working

Chances that fault remains unnoticed are nearly negligible as proper logs are in-serted into driver during hardening process. This helps to track the exact behavior.Even if some issue goes unnoticed by HHA, it can be detected by HMA throughlogs.

Hardware monitor agent (HMA) [Algorithm 2] Once the drivers are hardened, allthe hardware components are then continuously monitored by the monitoring agent.SHAPE is using “machine checks logs” to handle hardware failures. A machinecheck is the hardware’s way to raise alert for internal errors. The monitoring agentuses machine check logs and does the entry of hardware related error logs into thecentralized database. For making the error info more meaningful, we use freewaretools—MCELogs (Linux) and MCat (windows) before entering the information inthe database. Both these tools use a good lexical analyzer to read the logs and do theentry in the database. Fields that SHAPE use from these logs include:

– Event Type—To know the severity of events raised, SHAPE captures only logswith the event type: “ERROR” or “CRITICAL.”

– Event ID—Uniquely identify the event.– Source—Source is the software that logged the event, which can be either a pro-

gram name, such as “SQL Server,” or a component of the system or of a largeprogram, such as a driver name. For example, “Elnkii” indicates an EtherLink IIdriver.

– Log—The name of log file where the event was recorded.– Time Stamp—Time at which the error event is raised.

Hardware rebooting agent (HRA) [Algorithm 3] These agents reboot system aftergetting input from manager node. This is the only way through which SHAPE tryto automatically recover the system from hardware failures. HRA starts when thesystem starts and remains in the listening status as long as the system is in ACTIVEstate. Whenever it gets the RESTART request from manager node, it restarts the nodeand on restart sends the updated node status to manager. If restart fails to recover

SHAPE—an approach for self-healing and self-protection 597

Algorithm 2 Hardware Monitor Agents (HMA)1: BEGIN2: HdN {N1,N2,N3, . . . ,Ni} be the set of hardened Node3: for all Node Nc : HdN do4: PARSE MCE Logs5: GET EVENT_TYPE, EVENT_ID, SOURCE, LOG, TIME_STAMP6: if (EVENT_TYPE Equals (‘ERROR’ or ‘CRITICAL’) && LOG Equals

‘HARDWARE Events’) then7: UPDATE database with this LOG information along with the NODE_

NAME and MAC_ADDRESS8: ALERT raise!9: else

10: IGNORE11: end if12: end for

Algorithm 3 Hardware Rebooting Agents (HRA)1: BEGIN2: WAIT for Analyzer INPUT to restart3: RESTART the NODE4: if NODE Equals ‘RESTARTED’ then5: UPDATE → NODE_STATUS == ‘ACTIVE’6: else7: Raise ALERT8: end if

the failure, it raises the alert so that manual action can be taken to replace the faultyhardware.

Software agents Most distributed systems focus on hardware fault tolerance. Soft-ware fault tolerance is often overlooked. This is really surprising because hardwarecomponents have much higher reliability than the software that runs over them. Mostof the system designers go to great lengths to limit the impact of a hardware failureon system performance. However, they pay little attention to the systems behaviorwhen a software module fails. SHAPE provides agents to check for software failures.

Memory agent (MA) [Algorithm 4] Updates the CPU and Memory usage of theparticular nodes involved in job execution. A cron job task keeps on running at thebackground that monitors the node CPU and memory usage. IF the memory con-sumption or CPU usage is more than the threshold value, then an alert is raised. Thisthreshold value is auto configurable and value is set after analyzing the average loadon environment by the manager node. For example, if it is set to 70 %, then when-ever CPU or memory consumption crosses this value, an alert is raised. During peakworking loads, like for a cricket score broadcasting website when World Cup is going

598 I. Chopra, M. Singh

Algorithm 4 Memory Agents (MA)1: BEGIN2: Monitor CPU and MEMORY of NODE3: if (STATUS{CPU, MEMORY} > Threshold Value) then4: ALERT Raise5: UPDATE → CPU and RAM INFO6: end if

on, the threshold values even go up to 90 % to get full utilization of resources. Whenmost of resources are ideal, it decreases.

Application fault tolerant agent (AFTA) These agents help in checking the failureson the application level. This is the optional agent that can be ignored completelybased upon the configuration. SHAPE suggests hiding the job details in one wrapperthat will help the AFTA to monitor the job execution. This wrapper provides thefeatures like logging the start and end time of each module of submitted job, logother details like job id, expected finish time, and one can place the additional logginglevels [VERBOSE, DEBUG, INFO, ERROR]. If the application is taking more timethan the expected finish time, then an alert is logged.

Software rebooting agent (SRA) They work similar to HRA. The only difference isthat these agents restart the application not the hardware that is causing the alert tobe raised.

Network agents (NA) [Algorithm 5] NAs are used to detect and log network relatedfailures. SHAPE uses the pull based heart based model to update the node status tothe manager in that AU. In the pull based model, master node periodically asks forthe child nodes status to check their existence. In absence of response, the masterconsiders some failure has occurred at that node. To make the process more efficient,SHAPE has kept the status update period (P1) quite small for the nodes participatingin the job execution. For the nodes that are ideal for that period, they come in thesecond segment of period (P2) whose frequency for getting status update is morethan P1.

NA also measures how much data can be sent on or through a network resourcein a given time period. This also helps to calculate the amount of time for a packet totraverse, either a one-way or round trip, a network, and network segment or networkdevice. SHAPE is logging many extra details related to the network that are not atpresent getting used during the analysis phase. But in NA we consider that this in-formation will be very helpful in the near future to do detailed analysis and generatereports.

4.2.2 Security agents (SA)

SA’s [Algorithm 6] are used to check the system for various known and unknownattacks. SA captures all the new anomalies and logs the details into the database.Table 1 shows the attacks from which SHAPE targets to protect the system.

SHAPE—an approach for self-healing and self-protection 599

Algorithm 5 Network Agents (NA)1: BEGIN2: int TIMER_SHORT=5 Sec3: int TIMER_LONG=90 Sec4: List IDEAL_NODES Nid = {N1,N2, . . . ,Nn}5: List WORKING_NODES Nwr = {N ′

1,N′2, . . . ,N

′n}

6: while true do7: for all IDEAL_NODES do8: Get node Status9: Thread.sleep(TIMER_LONG)

10: end for11: for all WORKING_NODES do12: Get node Status13: Thread.sleep(TIMER_SHORT)14: end for15: end while

Algorithm 6 Security Agents (SA)1: BEGIN2: Capture Packets3: Parse captured packets4: for all Packets do5: if Packet != Profile(MIN, MAX) range then6: LOG details in log file.7: end if8: end for

SHAPE is using the snort anomaly detector version [40] to self-protect the systemfrom security attacks. Snort has been optimized to be integrated with SHAPE. Secu-rity agents run on each node participating in the grid and logs the details in databaseon Manager node of that AU (Table 1). This all is done using a preprocessor, whichreads the log and then represents each data instance as a vector of real numbers. Thispreprocessing is mandatory for the working of the detection engine that is used. Thenext function of the security agent is to raise an alert. It reads the predicted patterns ofpackets reaching the network (“Network Profile”) and compares them with the pack-ets captured earlier. It then logs an alert when the current value exceeds “minimum”to “maximum” range for that time.

SHAPE uses the concept of State Vector Machine (SVM) to act as network profile.SVM is one of the best supervised learning techniques, introduced in 1992 by Boser,Guyon, and Vapnik [41]. It interprets data and recognizes the patterns among them.In this, a model is designed on the basis on the training data. This model prophesiesthe target values of the test data given merely the test data attributes. It takes a set ofdata as input; for every given input SVM foretells, which of the possible classes willbe the output.

600 I. Chopra, M. Singh

Table 1 List of attacks for which SHAPE deals

Attack class Attack name Description

DOS 1. Smurf Attacks that disrupts a host or a network service in order to makelegitimate users not to use that network2. Neptune

3. Land

4. Teardrop

R2L (Remoteto Local)

5. Guess password Unauthorized attacks gain local access from a remote machineand then exploit that network6. IMAP

7. SPY

U2R (Userto Root)

8. Buffer overflow Local users get root access without authorization and then exploitthe network9. Rootkits

Probing 10. NMAP Attackers use programs to automatically scan the network forgathering information or finding known vulnerabilities11. Ports sweep

The preprocessor used in SHAPE gives output in a numeric form. This serves asan input to the SVM-based detection engine. The scaling of this input data is doneto ensure the dominance of attributes in smaller numeric ranges. Moreover, it alsoavoids the numerical difficulties arising during the calculations because of large at-tribute values. Each feature is scaled in the range [−1,+1] or [0,1]. The same methodis used to scale both training data as well as testing data. Out of the four commonlyused kernels, we chose the radial basis function (RBF) kernel as it can handle the casewhen the relation between class labels and attributes is nonlinear. Additionally, it hasfewer numerical difficulties. In order to increase the accuracy with which the classi-fier foretells the output for unknown testing data, the best possible values are selectedfor the parameters (C, gamma) by using cross-validation. In v-fold cross-validation,we first divide the training set into v subsets of equal size. Sequentially, one subsetis tested using the classifier trained on the remaining v − 1 subsets. Thus, each in-stance of the whole training set is predicted once so the cross-validation accuracy isthe percentage of data, which are correctly classified. Now the values of parametersobtained are used to train the model on the training set. As a result, a model file isobtained, which is used for the classification of testing data. The parameters usedto scale the training data are saved and retrieved when scaling is performed on thetesting data. After the successful completion of data scaling, testing is done on thescaled data. An output file is then obtained, which contains the predicted labels forthe testing data.

For example, consider a string input to the SVM. Firstly, it is parsed. After-ward, that parsed string is converted to a numeric form by applying some logic.This is done inside the preprocessor. Now the training set available with the SVMalso consists of numeric data in the same format as this input string which is al-ready converted to a real number format, but also with an output label appended toeach string. The choice is made for the closest match between the input to SVMand the data in the training set. The output is predicted on the basis of this closestmatch.

SHAPE—an approach for self-healing and self-protection 601

Suppose the input string is in the binary form:1:1 2:0 3:1 4:1 5:1 6:0 7:1 8:0

Training set contains the following strings:0.1 1: 1 2:0 3:1 4:1 5:0 6:0 7:0 8:00.2 1: 0 2:0 3:1 4:0 5:0 6:0 7:1 8:10.3 1: 1 2:0 3:1 4:0 5:1 6:1 7:0 8:00.4 1: 1 2:1 3:1 4:1 5:1 6:0 7:1 8:0

Now the training set in SVM, however, does not contain an exact match, still itforetells the output on the basis of closest match, i.e., the output will be the onefor which the maximum number of characteristics match. Hence, here the outputpredicted by the SVM is 0.4.

In addition, C and gamma values directly influence the accuracy of the SVM.A smaller value of C allows ignoring points close to the boundary, increasing themargin and we risk under fitting. When C is large, we increase the variance (try to fitas close as possible to the training data) with a risk of over fitting. For small valuesof gamma, the decision boundary is nearly linear. As gamma increases, the flexibilityof the decision boundary increases. A large value of gamma leads to over fitting.

4.3 Analyze and plan

After the monitor logs all the alerts, the analysis and plan component [Algorithm 7]start analyzing those logs to get something meaningful and plan proper action for thatalert. During the initial phase (or training phase), SHAPE takes some time to buildthe information about the nodes. Once the data is ready, SHAPE takes auto care of thefailures and security attacks in various ways discussed below. Alerts are categorizedin 4 main categories: hardware alert logs, software alert logs, network alert logs, andthe security alert logs.

Hardware alert logs Based on the alerts raised by HMA, the analysis unit startsanalyzing the hardware behavior. More are number of alerts logged for the particularnode, less is the probability of node to be chosen for job execution when anotheroption is available. If a certain node say N1 is having 10 instances of alerts attachedto its names, and N2 has 3 instances; in this scenario, N2 will be considered as themore stable node as compared to N1. In this way, the priority list is defined for allnodes participating in the network. If some hardware alert is raised for a certain nodesay NC during run time when the job is executing on that node, then depending uponthe severity of the alert, the job is resubmitted to other node NC1 and the plan ofaction is defined to restart the node NC through HRA.

Software alert logs Software logs give the correct picture of currently running soft-wares on different nodes. This includes memory usage of application, CPU usage ofapplication, and job execution status. If jobs are taking more time than the expectedtime, then the statuses of all the nodes participating in the job execution are verifiedto check if some problem exists. This check is based upon all hardware, software,and network logs.

602 I. Chopra, M. Singh

Algorithm 7 Analyzing Unit (AU)1: #Process Logs2: if NETWORK_ERROR then3: #Check for Hardware Errors4: if HW_ERROR then5: for NC where EVENT_TYPE == ‘CRITICAL’ ‖ ‘ERROR’ do6: SET STATUS_NC = ‘DOWN’7: Send MSG(HRA,NC,RESTART)

8: WAIT for status from HRA9: if STATUS_NC ! = ‘ACTIVE’ then

10: RAISE ALERT → EMAIL to Admin11: end if12: end for13: #check for Software Errors14: if SW_ERROR then15: for all NC(MEMORY ‖ CPU ‖ TIME_COMPLETION) > THRESHOLD

do16: SET STATUS_NC = ‘DOWN’17: Send MSG(SRA,NC,RESTART)

18: WAIT for status from HRA19: if STATUS_NC ! = ‘ACTIVE’ then20: RAISE ALERT → EMAIL to Admin21: end if22: end for23: end if24: end if25: #check for the NW Failures26: String[]traceResult = traceRoute(endpoint); //call traceroute27: for all traceResult do28: PRINT traceResult[i]29: if traceResult[i] == ‘FAILURE’ then30: Raise ’ALERT’31: end if32: end for33: end if34: #check for the Security Attacks35: Gather all new Alerts raised Al36: for all Al do37: Parse Alerts to get information about Port, URL and Payload.38: Group data based upon the above gathered information.39: Apply LCS to get largest common substring.40: Use this string as payload string to construct new signature.41: end for

SHAPE—an approach for self-healing and self-protection 603

Fig. 6 Signature generation

Network alert logs Helps to know the current status of the network, and thus helpsto take a correct decision to reduce failure rate.

Based upon the hardware, software, and network logs, the current picture of thecomplete environment is defined in the separate table in the database. This will helpto deal with failures of the submitted jobs from the start until they complete theircomputation. This will reduce the failure rate of the application. At present, oncethe failure is detected, SHAPE is only using the resubmission and restarting of hard-ware/software to deal with failures automatically. But the best part is that SHAPEis totally based upon the component based modular architecture in which one caneasily add/modify/remove any existing component without affecting any other com-ponent. This is one of the biggest strengths of SHAPE. It has provided the genericarchitecture, which can be molded easily with respect to the need.

Security alert logs Contains the security breach alerts raised by different securityagents. Based upon the analysis, a new signature is generated and configured auto-matically to avoid any future occurrence of the same attack again. This hastens theanalysis process.

For analyzing the security alert logs, we have used java based program to parsethe log file. A common token-subsequence signature for a set of parsed logs is foundby applying the longest common subsequence (LCSeq) problem to all signatures forthe flows contained in this set. The process is shown in Fig. 6.

Steps to generate signatures are summarized as:

– Collect security alerts from all the AEs participating in AU.– Parse alert using Java utility: This is a custom utility written to parse the alert log

file.– Group parsed data based upon ports, URLs, and payload (if available).– Apply the LCS algorithm on the payload of interest to get the common payload.– Save in signature database.

4.4 Execute

Once analysis of logs is done, the role of executor is to implement the analyzedinformation [Algorithm 8]. For Self-healing related execution, the executor main goalis to reduce the failure rate for job execution. For this, based upon the informationavailable from the analyzer, the executor keeps a check on new job submission andshould not be done to already faulty nodes, save the current state of the job, and thenrestart the node, and raise an alert if the restart still does not resolve the issue.

For Self-protection, “analysis signatures” generated by analyzer are further refinedand finalized to be used as a signature by snort. For this, analysis signatures are com-pared with existing signatures in the snort database. If they are a subset of existing

604 I. Chopra, M. Singh

Algorithm 8 Executor1: Self-Healing Execution2: if New JOB_SUBMISSION then3: if FAULT_NODE_LIST.contains(NODE_SELECTED) then4: Select different node5: end if6: end if7: if JOB_EXECUTION_NODE == ‘ERROR’ then8: Backup data (for this checkpointing can be added)9: Depending upon failure type, Send RESTART message to Restart agent

10: end if11: Self-Protection Execution12: for all Analysis Signature A_SIG do13: if A_SIG Already Exist then14: IGNORE!15: else if A_SIG ⊂ Existing Content then16: MERGE Signature to existing17: else18: Add as new.19: end if20: end for

signatures, they are merged and if new, they are simply added to the snort signaturedatabase.

4.5 Effector

Acts as the base for all communication between AEs and AUs. The main role of theeffector is to transfers rules, policies, and alerts to different nodes.

5 Implementation and results

In order to access the overall performance of SHAPE in a realistic scenario, a proto-type for the proposed architecture was implemented using Oracle Java SDK version 6,the well-known JADE platform for agents, Eclipse IDE, MySql database, and Snortas an anomaly detector. SHAPE is tested on the grid environment setup using theGlobus toolkit. The Globus toolkit 4.0 has been installed on all resource providersand the service they provide has been deployed as grid services. Along with Globus,SHAPE agents are also placed on each node to automatically handle faults and secu-rity attacks. The grid environment consists of 44 Intel Dual Core 2.2 GHz processorWindows XP nodes, 20 dual 2.4 GHz Xeon Linux nodes, and 5 nodes dual 450 MHzPII Linux clusters. Each node has 1 GB RAM and 80 GB HDD (Table 2). The TUgrid is exposed to the outer world with limited access. We believe that SHAPE willwork for any distributed system as its architecture is not middleware specific. Thesesystems are further categorized in three AUs for experimentation purposes, though

SHAPE—an approach for self-healing and self-protection 605

Table 2 SHAPE environmentdetails Configuration OS Number AU

Intel Dual Core 2.2 GHz Windows 44 AU1, AU2

2.4 GHz Xeon Linux 20 AU3

450 MHz PII Linux 5 AU3

Table 3 Self-protection metric

Self-healing metric Description

Throughput Defines as the number of jobs executed in given time

Turnaround time Defines as the interval between job submission and job execution

Waiting time The amount of time for which job has to wait before its execution starts

Failure rate Usually refers to the percentage of failures detected by the system

there is no hard rule to limit AEs in an AU. We have verified SHAPE in two parts:self-healing and self-protection.

5.1 Self-healing verification

For verifying self-healing, we have conducted different experiments with variationsin the number of jobs submitted and percentage of faults injected. Based upon theseexperiments, we have calculated average throughput, average turnaround time, av-erage waiting time, and average number of failures detected. Standard Metrics forverifying the SHAPE self-healing feature includes (Table 3):

Throughput Throughput is one of the most important standard metrics used to mea-sure performance of the fault tolerance system.

Throughput(n) = n/Tn,

where “n” represents total number of jobs; “Tn” is the total amount of time necessaryto complete “n” jobs.

In general, the average throughput of all techniques decreases with increase in thepercentage of faults injected and with the increase in the number of jobs submitted tothe system. Figures 7 and 8 shows the average throughput with and without SHAPEcomponent in place. When we use the SHAPE self-healing module to handle faults,the number of jobs executed per hour increases than in the normal scenario. Figure 7shows the throughput calculated when 1000 jobs with varying faults percentage areexecuted. With the increase in faults, the average throughput decreases, but remainsalways more than the system without any self-healing in use. As the number of jobsincreases, average throughput for the SHAPE model keeps on increasing wrt the nor-mal system.

Turn around time From a particular job point of view, the important criterion is howlong the system takes to execute that job. The interval from the time of submission of

606 I. Chopra, M. Singh

Fig. 7 Throughput vs. faultpercentage (1000 jobs)

Fig. 8 Throughput vs. faultpercentage (5000 jobs)

a job to the time of completion is the turnaround time.

Ta(n) =n∑

k=1

(Tck − Tsk)/n,

where for n number of jobs; Ta = Turnaround time; Tc = Completion Time; Ts =Submission Time.

Figures 9 and 10 shows Ta for the SHAPE system without any self-healing sys-tem deployed. We have tested the system for 1000 and 5000 jobs with a differentfaults percentage. We find that as the number of failures in the system increases,Ta increases. As compared to normal scenarios, in SHAPE Ta is less, because inSHAPE all failures are handled automatically, and this increases the speed of execu-tion. Another reason is the good selection of nodes when the job execution is about tostart. This is based on the information gathered about each node by different SHAPEagents. This information is then further analyzed by the analyzing unit of SHAPE andthe status of each resource is updated in the central database. Whenever a new job isgoing to start, the resource allocation is done wrt current status of nodes available inthe database.

Unavailability/waiting time Waiting time in simple words is the delay in the startof job execution. It is the average amount of time between submitting the job to the

SHAPE—an approach for self-healing and self-protection 607

Fig. 9 Turnaround time vs.fault percentage (1000 jobs)

Fig. 10 Turnaround time vs.fault percentage (5000 jobs)

system and starting the execution on one of the execution nodes.

Wt(n) =n∑

k=1

(Tek − Tsk)/n,

where for n number of jobs; Wt = Waiting Time; Te = Execution starts time; Ts =Submission Time.

Figures 11 and 12 show Wt for SHAPE and without SHAPE, with a differentpercentage of faults injected in the system and on different number of jobs executed,Wt keeps on increasing with their increase of numbers. The SHAPE based modelhas Wt less than the normal system. The reason being is that SHAPE has preplanned“Node” categorization wrt, the current state of nodes. This decreases the executionstart time Te, and thus decreases waiting time.

Failure tendency Failure tendency represents the percentage of failures detectedby the system. Figure 13 shows the failure tendency of SHAPE. SHAPE works forthree kinds of failures: software, network, and hardware. Tests are performed with aconstant number of failures introduced for a different number of jobs. Based uponthe tests performed, it is shown that SHAPE keeps on detecting different types offailures irrespective of the number of jobs submitted. The best fault detection rate isfor hardware failures.

Ft =∑

Failure Type/Total number of failures,

608 I. Chopra, M. Singh

Fig. 11 Waiting time vs. faultpercentage (1000 jobs)

Fig. 12 Waiting time vs. faultpercentage (5000 jobs)

Fig. 13 Jobs submitted vs.percentage failure detected

where, Failure Type will be (Hw = Hardware Failures, Sw = Software Failures, andNw = Network Failures).

5.2 Self-protection verification

In order to explicate how attackers can take the advantage of security vulnerabili-ties in computer systems or software services, in each attack category, attacks aresimulated using different tools. For DDoS attacks, the metasploit framework is usedto launch smurf, land, Neptune, and teardrop attacks. In R2L, hydra is used to doguess password and spying. NetCat is used to simulate L2R attacks. NMap is usedfor network probing. To evaluate the efficiency of the SHAPE self-protection, wehave considered following metrics (Table 4).

SHAPE—an approach for self-healing and self-protection 609

Table 4 Self-protection metric

Self-protection metric Description

Detection rate Detection rate is the number of attacks detected and blocked

False positive rate Total number of normal instances that were incorrectly consideredas attacks

Signature generation frequency Number of attacks generated in certain period

Fig. 14 Detection rate vs.attacks [known attacks]

Detection rate Detection rate DR, is the number of attacks detected and blocked

DR = Total No. of True Positives/Total No. of intrusions.

SHAPE’s detection rate keeps on increasing with time. As the system crosses thetraining period, its capability to detect new attacks keeps on increasing. For everynew attack/intrusion detected, the signature database keeps on updating with the newrules to prevent the same attack/intrusion for reoccurring. When more known attacksare there, then the detection rate in both cases is nearly the same.

In Fig. 14, it is clear that when the attacks occurred on the network are alreadyknown and their signature exists in all the three systems, then the detection rate of allthe approaches is nearly the same. In Fig. 15, when more unknown attacks are firedon the network, SHAPE is quite effective than even snort as an anomaly detector. Forverifying this, we have removed few known attacks signatures from all three and thenlaunch those attacks on network.

When more unknown attacks are added into the system, then the snort as a signa-ture based system fail to detect unknown attacks. Detection rate for SHAPE keeps onincreasing with time (Fig. 16). In earlier weeks, when SHAPE is deployed, the detec-tion rate is less as the learning is minimal. As time passed, the detection capabilitiesof SHAPE increases. We have verified this by taking the combination of known andunknown attacks in packets passed to the security agent. After week 8, the detectionrate of SHAPE starts improving.

False positive rate False positive rate is the total number of normal instancesthat were incorrectly considered as attacks. Mathematically, False Positive RateFPR = FP/(TN + FP), where FP = False Positives and TN = True Negatives.

610 I. Chopra, M. Singh

Fig. 15 Detection rate vs.attacks [unknown attacks]

Fig. 16 Detection rate vs. time

Fig. 17 False positive rate vs.time

In SHAPE, the FPR keeps on decreasing with the passage of time. Figure 17 showsdifferent categories of attacks and their false positive rates. With time, the rules autogenerated by SHAPE become more stable, and thus the rate start coming down. ForDDoS attacks, FPR is little higher as there are quite large number of variances inthese attacks as compared to other attacks.

Signature generation frequency Signature generation frequency defines the percent-age of signatures generated with time. Figure 18 shows the signature generation fre-quency for SHAPE wrt snort. In SHAPE, the signature generation is the automaticprocess so the count of signature generation depends upon the network analysis. Onother hand, in snort, we only have manual updates.

SHAPE—an approach for self-healing and self-protection 611

Fig. 18 Signature generationfrequency

6 Conclusions

As complexity of distributed networks increase, they become more unreliable dueto the involvement of large heterogeneous resources located in different geographicaldomain. For these types of networks, secure and fault tolerant resource allocation ser-vices have to be provided. This paper listed the most common protection threats andthe failures that most of the present distributed systems encounter. A new approachcalled SHAPE is presented that helps to deal with failures and security threats withthe least manual administrative intervention. SHAPE is an agent based approach,which is built upon the autonomic computing architecture as a base. It involves var-ious autonomic elements communicating with each other to self-protect the systemfrom attacks and failures. The autonomic element consists of sensors, monitors, an-alyzer, planner, executor, and effector. SHAPE is implemented purely using opensource technologies and results are provided to support the research. From results,it is shown that SHAPE reduces the failure rate and also increases the efficiency ofsystem by executing a higher number of jobs per hour. It is also capable of detectingnew anomalies and then automatically generating signature definitions to block theirreoccurrences. Work on SHAPE is still in progress.

Future works include:

– Performance Optimization: Increase the performance of the system by further finetuning the algorithms used.

– Component Level Testing: Test the system by adding checkpointing, replication,and scheduling based approaches into AE.

– Standard Platform: Make SHAPE as the standard platform for self-healing andself-protection in distributed networks. Web service based APIs are to be exposedto access the functionality of SHAPE from anywhere.

– Reports: Include an extensive report generation engine to generate the completenetwork details at the line and detail level.

References

1. Foster I, Kesselman C, Nick JM, Tuecke S (2004) The physiology of the grid. Global Grid Forum2. Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A,

Stoica I, Zaharia M (2010) A view of cloud computing. Commun ACM 53(4):50–58

612 I. Chopra, M. Singh

3. Sun W, Zhang X, Guo CJ, Sun P, Su H (2008) Software as a service: configuration and customizationperspectives. In: Congress on services part II, SERVICES-2, 23–26 Sept. IEEE Press, New York

4. Foster I, Zhao Y, Raicu I, Lu S (2008) Cloud computing and grid computing 360-degree compared.In: IEEE grid computing environments GCE08

5. Chopra I, Singh M (2010) Analysing the need for autonomic behaviour in grid computing. In: Com-puter and automation engineering (ICCAE). IEEE Press, New York, pp 535–539

6. Ganek AG, Corbi TA (2003) The dawning of the autonomic computing era. IBM Syst J 42(1):5–187. Psaier H Dustdar S (2011) A survey on self-healing systems: approaches and systems. Computing.

Springer, Berlin8. Ghosh D, Sharman R, Raghav Rao H, Upadhyaya S (2006) Self-healing systems—survey and syn-

thesis. Elsevier, Amsterdam9. Parashar M, Hariri S (2005) Autonomic computing: an overview Springer, Berlin, pp 247–259

10. Garg R, Singh AK (2011) Fault tolerance in grid computing: state of art and open issues. Int J ComputSci Eng Surv IJCSES

11. Hoffmann H, Eastep J et al (2010) Application heartbeats: a generic interface for specifying programperformance and goals in autonomous computing environments. In: Proceeding of the 7th interna-tional conference on autonomic computing, ICAC. New York, NY, USA. ACM, New York, 7988 pp

12. Maggio M, Hoffmann H, Leva A (2010) Controlling software applications via resource allocationwithin the heartbeats framework In: Proceeding of the 49th international conference on decision andcontrol, Atlanta, USA. IEEE Press, New York

13. Maggio M, Hoffmann H, Santambrogio MD, Agarwal A, Leva A (2011) A comparison of autonomicdecision making techniques. Computer science and artificial intelligence laboratory technical report,MIT

14. Amoon M (2011) A development of fault-tolerant and scheduling system for grid computing. GESJ,Comput Sci Telecommun

15. Zhao L, Ren Y, Xiang Y, Sakurai K (2010) Faulttolerant scheduling with dynamic number of replicasin heterogeneous systems. In: High performance computing and communications (HPCC), pp 434–441

16. Claudel B, De Palma N, Lachaize R, Hagimont D (2006) Self-protection for distributed componentbased applications. Springer, Berlin

17. Networks A (2012) Arbor special report: worldwide infrastructure security report, vol VIII18. Varalakshmi P, Thamarai Selvi S (2013) Thwarting DDoS attacks in grid using information diver-

gence. Future Gener Comput Syst19. Cappelli D, Moore A, Trzeciak R (2012) The CERT guide to insider threats: how to prevent, de-

tect, and respond to information technology crimes (theft, sabotage, fraud). SEI series in softwareengineering. Addison–Wesley, Reading, 28 pp

20. Humphrey M, Thompson M (2002) Security implications of typical grid computing usage scenarios,cluster computing, vol 5, issue 3, July

21. Kendall K (1999) In: A database of computer attacks for the evaluation of intrusion detection systems,June

22. McAfee (2006) Archived from the original on rootkits, part 1 of 3: the growing threat23. NMAP Homepage (1998) http://www.insecure.org/nmap/index.html24. Bace R (2000) Intrusion detection. Macmillan Technical Publishing, Indianapolis25. Baker A, Beale J, Caswell B, Poore M (2004) Snort 2.1 intrusion detection, 2nd edn. http://www.

snort.org/26. Roesch M (1999) Snort—lightweight intrusion detection for networks. In: Proceedings of the 13th

LISA conference of USENIX association27. Ali Aydin M, Halim Zaim A, Gökhan Ceylan K (2009) A hybrid intrusion detection system design

for computer network security. Comput Electr Eng 35:517–52628. Ding Y-X, Xiao M, Liu A-W (2009) Research and implementation on snort-based hybrid intrusion

detection system. In: Proceedings of the eighth international conference on machine learning andcybernetics, Baoding, 12–15 July. IEEE Press, New York. doi:10.1109/ICMLC.2009.5212282

29. Gomez J, Gil C, Padilla N, Banos R, Jimenez C (2009) Design of a snort-based hybrid intrusiondetection system. In: IWANN 2009, part II. LNCS, vol 5518, pp 515–522

30. Katkar V, Bhirud SG (2012) Novel DoS/DDoS attack detection and signature generation. Int J ComputAppl 47(10):18–24

31. Diaz-Verdejo JE, Garcia-Teodoro P, Munoz P, Macia-Fernandez G (2007) A Snort-based approachfor the development and deployment of hybrid IDS. IEEE Lat Am Trans 5(6):386–392

SHAPE—an approach for self-healing and self-protection 613

32. Hwang K, Cai M, Chen Y, Qin M (2007) Hybrid intrusion detection with weighted signature genera-tion over anomalous Internet episodes. IEEE Trans Dependable Secure Comput 4(1):41–55

33. Wuu LC, Hung CH, Chen SF (2007) Building intrusion pattern miner for Snort network intrusiondetection system. J Syst Softw 80(10):1699–1715

34. Brahmi I, Yahia SB, Poncelet P (2011) A Snort-based mobile agent for a distributed intrusion de-tection system. In: SECRYPT 2011—proceedings of the international conference on security andcryptography, Seville, Spain, 18–21 July

35. Suryawanshi GR, Vanjale SB (2010) Mobile agent for distributed intrusion detection system in dis-tributed system. In: Proceedings in international journal of artificial intelligence and computationalresearch (IJAICR), June

36. Exposito JA Ametller J Robles S (2010) Configuring the JADE HTTP MTP37. Zolnowski AP (2012) JADE-PKI 1.0 manual, 9 September38. Linux Kernel Mailing List (2006) Fixes for uli5261 (tulip driver), Aug. http://lkml.org/lkml/2006/8/

19/5939. Kadav A, Renzelmann MJ, Swift MM (2009) Tolerating hardware device failures in software,

SOSP’09, 11–14 October40. Maciej Szmit, Adamus S, Bugaa S, Szmit A (2012) Anomaly detection 3.0 for Snort. Snort.AD

Project41. Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: 5th

annual ACM workshop on COLT. ACM Press, Pittsburgh, pp 144–152