Network-level polymorphic shellcode detection using …fabian/course_papers/polymorphic-detect.pdfof...

J Comput Virol (2007) 2:257–274DOI 10.1007/s11416-006-0031-z

ORIGINAL PAPER

Network-level polymorphic shellcode detection using emulation

Michalis Polychronakis · Kostas G. Anagnostakis ·Evangelos P. Markatos

Received: 31 July 2006 / Accepted: 1 November 2006 / Published online: 23 December 2006© Springer-Verlag France 2006

Abstract Significant progress has been made in recentyears towards preventing code injection attacks at thenetwork level. However, as state-of-the-art attack detec-tion technology becomes more prevalent, attackers arelikely to evolve, employing techniques such as poly-morphism and metamorphism to defeat these defenses.A major outstanding question in security research andengineering is thus whether we can proactively developthe tools needed to contain advanced polymorphic andmetamorphic attacks. While recent results have beenpromising, most of the existing proposals can be defeatedusing only minor enhancements to the attack vector.In fact, some publicly-available polymorphic shellcodeengines are currently one step ahead of the mostadvanced publicly-documented network-level detectors.In this paper, we present a heuristic detection methodthat scans network traffic streams for the presence ofpreviously unknown polymorphic shellcode. In contrastto previous work, our approach relies on a NIDS-embedded CPU emulator that executes every poten-tial instruction sequence in the inspected traffic, aimingto identify the execution behavior of polymorphic shell-code. Our analysis demonstrates that the proposed

M. Polychronakis (B) · E. P. MarkatosInstitute of Computer Science,Foundation for Research & Technology – Hellas,Heraklion, Crete, Greecee-mail: [email protected]

E. P. Markatose-mail: [email protected]

K. G. AnagnostakisInstitute for Infocomm Research,Singapore, Singaporee-mail: [email protected]

approach is more robust to obfuscation techniques likeself-modifications compared to previous proposals, butalso highlights advanced evasion techniques that needto be more closely examined towards a satisfactory solu-tion to the polymorphic shellcode detection problem.

1 Introduction

The primary aim of an attacker or an Internet worm is togain complete control over a target system. This is usu-ally achieved by exploiting a vulnerability in a servicerunning on the target system that allows the attacker todivert its flow of control and execute arbitrary code. Theexecution path of the vulnerable service can be divertedusing several exploitation methods, such as buffer over-flows, integer overflows, format string abuse, and arbi-trary data corruption. The code that is executed afterhijacking the instruction pointer is usually provided aspart of the attack vector. Although the typical actionof the injected code is to spawn a shell (hereby dubbedshellcode), the attacker can structure it to perform arbi-trary actions under the privileges of the service thatis being exploited [1]. For example, the “shellcode” ofrecent worms usually just connects back to the previousvictim, downloads the main body of the worm, and exe-cutes it. In this paper we use the term shellcode to referto malicious injected code with any purpose.

Significant progress has been made in recent yearstowards detecting previously unknown code injectionattacks at the network level [2–10]. However, as orga-nizations start deploying state-of-the-art detection tech-nology, attackers are likely to react by employingadvanced evasion techniques, such as polymorphism and

258 M. Polychronakis et al.

metamorphism, known from the virus scene since theearly 1990s [11], to defeat these defenses.

Polymorphic shellcode engines create different formsof the same initial shellcode by encrypting its body witha different random key each time, and by prependingto it a decryption routine that makes it self-decrypting.Since the decryptor itself cannot be encrypted, someintrusion detection systems rely on the identification ofthe decryption routine of polymorphic shellcodes. Whilenaive encryption engines produce constant decryptorcode, advanced polymorphic engines mutate the decryp-tor using metamorphism [12], which collectively refersto techniques such as dead-code insertion, code trans-position, register reassignment, and instruction substi-tution [13], making the decryption routine difficult tofingerprint.

A major outstanding question in security researchand engineering is thus whether we can proactivelydevelop mechanisms for automatic containment ofadvanced polymorphic attacks at the network-level.While results have been promising, and someapproaches can cope with limited polymorphism, whenpolymorphism and metamorphism is combined withadvanced evasion techniques like self-modifying code,as we demonstrate, most of the existing proposals canbe easily defeated.

In this paper, we revisit the question of whether poly-morphic shellcode is detectable at the network-level. Wepresent a detection heuristic that tests byte sequencesin network traffic for properties similar to polymor-phism. Specifically, we speculatively execute potentialinstruction sequences and compare their execution pro-file against the behavior observed to be inherent to poly-morphic shellcode. Our approach relies on a fully-blownIA-32 CPU emulator, which, in contrast to previouswork, makes the detector immune to runtime evasiontechniques such as self-modifying code.

The remainder of this paper is organized as follows.We summarize related work in Sect. 2, and discuss tech-niques that can be used for evading current network-level detectors in Sect. 3. We present in detail ouremulation-based detection method in Sect. 4, and exper-iments examining the performance of our approach inSect. 5. Finally, Sect. 6 discusses limitations and issuesthat need further research, and Sect. 7 concludes thepaper.

2 Related work

Network intrusion detection systems (NIDS) like Snort[14] and Bro [15] have been extensively used for shell-code detection. Although such systems usually detect

known attacks, for which a precise signature exists, theycan also be used for detecting previously unseen attacksusing generic signatures that match components com-mon to similar exploits, such as the NOP sled, pro-tocol framing, or specific parts of the shellcode [16].As a response to signature-based NIDS, attackers havestarted to employ encryption and polymorphism forevading detection [17–20].

Initial approaches on zero-day polymorphic shell-code detection focused on the identification of the sledcomponent [21,22] that is often prepended to the begin-ning of the shellcode. However, sleds are mostly use-ful in expediting exploit development, and in severalcases, especially in Windows exploits, can be completelyavoided through careful engineering using registersprings [23]. In fact, most infamous Internet worms sofar did not employ sleds. Our approach focuses on thedetection of the polymorphic shellcode itself, and thusworks even in the absence of a sled component. But-tercup [24] attempts to detect polymorphic buffer over-flow attacks by identifying the ranges of the possiblereturn addresses for existing buffer overflow vulnerabil-ities. Unfortunately, this heuristic cannot be employedagainst some of the more sophisticated buffer overflowattack techniques [25].

Several research efforts have focused on the auto-mated generation of signatures for previously unknownworms. These methods are based on the prevalence ofcommon byte sequences across different worminstances, among other characteristics, and derive sig-natures that match zero day worms by correlating pay-loads from different suspicious traffic flows [2,3,26].However, these approaches are prone to false positives,and are ineffective against polymorphic worms [27],which do not contain sufficiently long common bytesequences.

Polygraph [4], PAYL [6], PADS [5], and Hamsa [10],generate signatures that can capture polymorphic wormsby identifying common invariants among different worminstances, such as return addresses, protocol framing,and poor obfuscation. The derived signatures areexpressed as regular expressions or statistical byte distri-butions. Although above approaches can identify simpleobfuscated worms, their effectiveness is still question-able in the presence of extensive polymorphism [20].Furthermore, Polygraph and Hamsa rely on an first-levelclassifier that splits the traffic into two different pools forinnocuous and malicious samples, which introduces anadditional avenue for evasion [28].

A fundamental limitation of all above automated sig-nature generation methods is that they require multi-ple worm instances before reasoning for a threat, whichmakes them ineffective against targeted attacks. In

Network-level polymorphic shellcode detection using emulation 259

contrast, our proposed method identifies each attackseparately, which makes it also effective for targetedattacks.

Having identified the limitations of signature-basedapproaches, recent research efforts, most closely relatedto our work, have turned to static binary code anal-ysis for identifying exploit code inside network flows.Payer et al. [29] describe a hybrid polymorphic shellcodedetection engine based on a neural network that com-bines several heuristics, including a NOP-sled detectorand recursive traversal disassembly. However, the neu-ral network must be trained with both positive and nega-tive data in order to achieve a good detection rate, whichmakes it ineffective against zero-day attacks. Kruegelet al. [7] present a worm detection method that identi-fies structural similarities between different worm muta-tions. Styx [8] differentiates between benign data andprogram-like exploit code in network streams by lookingfor meaningful data and control flow, and blocks iden-tified attacks using automatically generated signatures.SigFree [9] detects the presence of attack code insidenetwork packets by pruning seemingly useless instruc-tions using data flow anomaly detection, and finding anincreased number of remaining useful instructions.

The main limitation of above static analysis basedapproaches, and main motivation for our work, is thatan attacker can effectively and effortlessly hinder staticanalysis, and thus evade detection. We discuss such eva-sion methods in the following section.

3 Static analysis resistant polymorphic shellcode

Several research efforts have turned to static binary codeanalysis for detecting previously unknown polymorphiccode injection attacks at the network level [7–9,21,22,29]. These approaches treat the input network stream aspotential machine code and analyze it for signs of mali-cious behavior. The first step of the analysis involves thedecoding of the binary machine instructions into theircorresponding assembly language representation, a pro-cess called disassembly. Some methods rely solely todisassembly for identifying long instruction chains thatmay denote the existence of a NOP sled [21,22] or shell-code [29]. After the code has been disassembled, sometechniques derive further control or data flow informa-tion that is then used for the discrimination betweenshellcode and benign data [7–9].

However, after the flow of control reaches the shell-code, the attacker has complete freedom to structureit in a complex way that can thwart attempts to stati-cally analyze it. In this section, we discuss ways in whichpolymorphic code can be obfuscated for evading

network-level detection methods based on static binarycode analysis.

Note that the techniques presented here are rathertrivial, compared to elaborate binary code obfuscationmethods [30–32], but powerful enough to illustrate thelimitations of detection methods based on static analy-sis. Advanced techniques for complicating static analy-sis have also been extensively used for tamper-resistantsoftware and for preventing the reverse engineering ofexecutables, as a defense against software piracy [33–35].

3.1 Thwarting disassembly

There are two main disassembly techniques: linear sweepand recursive traversal [36]. Linear sweep begins withthe first byte of the stream and decodes each instruc-tion sequentially, until it encounters an invalid opcodeor reaches the end of the stream. The main advantage oflinear sweep is its simplicity, which makes it very light-weight, and thus an attractive solution for high-speednetwork-level detectors.

Since the IA-32 instruction set is very dense, with248 out of the 256 possible byte values representing alegitimate starting byte for an instruction, disassemblingrandom data is likely to give long instruction sequencesof seemingly legitimate code [37]. The main drawbackof linear sweep is that it cannot distinguish betweencode and data embedded in the instruction stream, andincorrectly interprets them as valid instructions [38]. Anattacker can exploit this weakness and evade detectionmethods based on linear sweep disassembly using well-known anti-disassembly techniques. The injected codecan be obfuscated by interspersing junk data among theexploit code, not reachable at runtime, with the pur-pose to confuse the disassembler. Other common anti-disassembly techniques include overlapping instructionsand jumping into the middle of instructions [39].

The recursive traversal algorithm overcomes someof the limitations of linear sweep by taking into accountthe control flow behavior of the program. Recursive tra-versal operates in a similar fashion to linear sweep, butwhenever a control transfer instruction is encountered,it determines all the potential target addresses and pro-ceeds with disassembly at those addresses recursively.For instance, in case of a conditional branch, it con-siders both the branch target and the instruction thatimmediately follows the jump. In this way, it can “jumparound” data embedded in the instruction stream whichare never reached during execution.

Figure 1 shows the disassembly of the decoder partof a shellcode encrypted using the Countdown encryp-tion engine of the Metasploit Framework [40] using lin-ear sweep and recursive traversal. The code has been


Fig. 1 Disassembly of thedecoder produced by theCountdown shellcodeencryption engine using(a) linear sweep and(b) recursive traversal

(a) (b)

mapped to address 0x0000 for presentation purposes.The target of the call instruction at address 0x0003lies at address0x0007, one byte before the end ofcall,i.e., thecall instruction jumps to itself. This tricks lineardisassembly to interpret the instructions immediatelyfollowing the call instruction incorrectly. In contrast,recursive traversal follows the branch target and disas-sembles the overlapping instructions correctly.

However, the targets of control transfer instructionsare not always identifiable. Indirect branch instructionstransfer control to the address contained in a registeroperand and their destination cannot be statically deter-mined. In such cases, recursive traversal also does notprovide an accurate disassembly, and thus, an attackercould use indirect branches extensively to hinder it.Although some advanced static analysis methods canheuristically recover the targets of indirect branches,e.g., when used in jump tables, they are effective onlywith compiled code and well-structured binaries [36,38,41,42]. A motivated attacker can construct highlyobfuscated code that abuses any assumptions about thestructure of the code, including the extensive use of indi-rect branch instructions, which impedes both disassem-bly methods.

3.2 Thwarting control and data flow analysis

Once the code has been disassembled, some approachesanalyze the code further using control flow analysis,by extracting its control flow graph (CFG). The CFGconsists of basic blocks as nodes, and potential controltransfers between blocks as edges. Kruegel et al. [7] usethe CFG of several instances of a polymorphic wormto detect structural similarities between different muta-tions. Chinchani et al. [8] differentiate between data andexploit code in network streams based on the controlflow of the extracted code.

SigFree, proposed by Wang et al. [9], uses both con-trol and data flow analysis to discriminate between codeand data. Data flow analysis examines the data oper-ands of instructions and tracks the operations that areperformed on them within a certain code block. Afterthe extraction of the control flow graph, SigFree uses

Fig. 2 A modified, static analysis resistant version of the Count-down decoder

data flow analysis techniques to prune seemingly uselessinstructions, aiming to identify an increased number ofremaining useful instructions that denote the presenceof code.

However, even if a precise approximation of the CFGcan be derived in the presence of indirect jumps or otheranti-disassembly tricks, a motivated attacker can stillhide the real CFG of the shellcode using self-modifyingcode, a much more powerful technique. Self-modifyingcode modifies its own instructions dynamically at run-time. Although payload encryption is also a form of self-modification, in this section we consider modificationsto the decoder code itself, which is the only shellcodepart exposed to static binary code analysis.

Since self-modifying code can transform almost anyinstruction of itself to a different instruction, an attackercan construct a decryptor that will eventually executeinstructions that do not appear in the initial code image,on which static analysis methods operate on. Thus, cru-cial control transfer or data manipulation instructionscan be concealed behind fake instructions, specificallyselected to hinder control and data flow analysis. Thereal instructions will be written into the shellcode’s mem-ory image while it is executing, and thus are inaccessibleto static binary code analysis methods.

A very simple example of this technique, also knownas “patching,” is presented in Fig. 2, which shows therecursive traversal disassembly of a modified version ofthe Countdown decoder presented in Fig. 1. There aretwo main differences: anadd instruction has been addedat address 0x000A, and the loop 0xA instruction hasbeen replaced by add bh,dl. At first sight, this code


Fig. 3 Execution trace of the modified Countdown decoder

(a) (b)

Fig. 4 Control flow graph of the modified Countdown decoder(a) based on the code derived using recursive traversal disassem-bly, and (b) based on its actual execution

does not look like a polymorphic decryptor, since theflow of control is linear, without any backward jumpsthat would form a decryption loop. However, in spite ofthe intuition we get by statically analyzing the code, thecode is indeed a polymorphic decryptor which decryptsthe encrypted payload correctly, as shown by the execu-tion trace of Fig. 3.

The decoder starts by initializing ecx with the value0x7F, which corresponds to the encoded payload sizeminus one. The call instruction sets the instructionpointer to the relative offset -1, i.e., the inc ecxinstruction at address0x0007.Pop then loads the returnaddress that was pushed in the stack by call in ecx.These instructions are used to find the absolute addressfrom which the decoder is executing, as discussed inSect. 4.1.2.

The crucial point is the execution of the add[esi+0xA],0xE0 instruction. The effective addressof the left operand corresponds to address 0x0012, soadd will modify its contents. Initially, at this address isstored the instruction add bh,dl. By adding the value0xE0 to this memory location, the code at this locationis modified and add bh,dl is transformed to loop0xe. Thus, while the decryptor is executing, as soon asthe instruction pointer reaches the address 0x0012, theinstruction that will actually be executed is loop 0xe.

Even in this simple form, the above technique is veryeffective in obfuscating the real CFG of the shellcode.Indeed, as shown in Fig. 4, a slight self-modificationof just one instruction results to significant differencesbetween the CFG derived using static analysis, and theactual CFG of the code that is eventually executed.If such self-modifications are applied extensively, thenthe real CFG can effectively be completely concealed.Going one step further, an attacker could implement apolymorphic engine that produces decryptors with arbi-trarily fake CFGs, different in each shellcode instance,for evading detection methods based on CFG analysis.This can be easily achieved by placing fake control trans-fer instructions which during execution are overwrittenwith other useful instructions. Instructions that manipu-late crucial data can also be concealed in the same man-ner in order to hinder data flow analysis. Static binarycode analysis would need to be able to compute theoutput of each instruction in order to extract the realcontrol and data flow of the code that will be eventuallyexecuted.

4 Network-level execution

Carefully crafted polymorphic shellcode can evadedetection methods based on static binary code analy-sis. Using anti-disassembly techniques, indirect controltransfer instructions, and most importantly, self-modifications, static analysis resistant polymorphic shell-code will not reveal its actual form until it is eventuallyexecuted on a real CPU. This observation motivated usto explore whether it is possible to detect such highlyobfuscated shellcode by actually executing it, using onlyinformation available at the network level.

4.1 Approach

Our goal is to detect network streams that contain poly-morphic exploit code by passively monitoring the incom-ing network traffic. Each request to some networkservice hosted in the protected network is treated asa potential attack vector. The detector attempts to exe-cute each incoming request in a virtual environment asif it was executable code. Depending on the executionbehavior, we can differentiate between benign data andpolymorphic shellcode. Besides the NOP sled, whichmight not exist at all [23], the only executable part ofpolymorphic shellcodes is the decryption routine. There-fore, the detection algorithm focuses on the identifica-tion of the decryption process that takes place duringthe initial execution steps of a polymorphic shellcode.


Being isolated from the vulnerable host, the detectorlacks the context in which the injected code would run.Crucial information such as the OS of the host and theprocess being exploited might not be known in advance.At first sight, under these extremely obscure conditions,it does not seem feasible to fully simulate the executionof arbitrary shellcode by relying only to the capturedattack vector. For instance, knowing the OS type andversion is crucial for emulating system calls.

In this work, we focus on the detection of polymor-phic shellcodes. The execution of a polymorphic shell-code can be conceptually split in two sequential parts:the execution of the decryptor, and the execution of theactual payload. The accurate execution of the payload,which usually includes several advanced operations suchas the creation of sockets or files, would require a com-plete virtual machine environment, including an appro-priate OS. In contrast, the decryptor is in essence a seriesof machine instructions that perform a certain compu-tation over the memory locations where the encryptedshellcode has been injected. This allows us to simulatethe execution of the decryptor using merely a CPU emu-lator. The only requirement is that the emulator shouldbe compatible with the hardware architecture of the vul-nerable host. For our prototype, we have focused on theIA-32 architecture.

Up to this point, the context of the vulnerable processin which the shellcode would be injected is still missing.Specifically, since the emulator has no access to the tar-get host, it lacks the memory and CPU state of the vul-nerable process at the time its flow of control is divertedto the injected code. However, the construction of poly-morphic shellcodes conforms to several restrictions thatallow us to simulate the execution of the decryptor part,even without having any further information about thecontext in which it is destined to run. In the remainderof this section we discuss these restrictions.

4.1.1 Position-independent code

In a dynamically changing stack or heap, the exact mem-ory location where the shellcode will be placed is notknown in advance. For this reason, any absolute address-ing is avoided and reliable shellcode is made completelyrelocatable, in order to run from any memory posi-tion. Otherwise, the exploit becomes fragile [1]. Forinstance, in case of Linux stack-based buffer overflows,the absolute address of the vulnerable buffer variesbetween systems, even for the same compiled execut-able, due to the different environment variables thatare stored in the beginning of the stack. This position-independent nature of shellcode allows us to map it in

an arbitrary memory location and start its executionfrom there.

4.1.2 GetPC code

Both the decryptor and the encrypted payload are partof the injected vector, with the decryptor stub usuallyprepended to the encrypted payload. Since the abso-lute memory address of the injected shellcode cannotbe accurately predicted in advance, the decoder needsto somehow find a reference to this exact memory loca-tion in order to decrypt the encrypted payload.

To this end, shellcodes take advantage of the CPUprogram counter (PC, or EIP in the IA-32 architecture).During the execution of the decryptor, the PC points tothe decryptor code, i.e., to an address within the memoryregion where the decryptor, along with the encryptedpayload, has been placed. However, the IA-32 architec-ture does not provide any EIP-relative memory address-ing mode,1 as opposed to instruction dispatch. Thus, thedecryptor cannot use the PC directly to reference to thememory locations of the encrypted payload in order tomodify it. Instead, the decryptor first loads the currentvalue of the PC to a register, and then uses this value tocompute the absolute address of the payload. The codethat is used for retrieving the current PC value is usuallyreferred to as the “getPC” code.

The simplest way to read the value of the PC isthrough the use of the call instruction. The intendeduse of call is for calling a procedure. When the callinstruction is executed, the CPU pushes the returnaddress in the stack, and jumps to the first instruction ofthe called procedure. The return address is the address ofthe instruction immediately following the call instruc-tion. Thus, the decryptor can compute the address ofthe encrypted payload by reading the return addressfrom the stack and adding to it the appropriate off-set in order to reference the payload memory loca-tions. This technique is used by the decryptor shown inFig. 1. The encrypted payload begins at address0x0010.Call pushes in the stack the address of the instruc-tion immediately following it (0x0008), which is thenpopped to esi. The size of the encrypted payload iscomputed in ecx, and the effective address computa-tion [esi+ecx+0x7] in xor corresponds to the lastbyte of the encrypted payload at address 0x08F. As thename of the engine implies, the decryption is performedbackwards, one byte at a time, starting from the lastencrypted byte.

1 The IA-64 architecture supports a RIP-relative data addressingmode. RIP stands for the 64bit instruction pointer.


Fig. 5 The decryptor of the PexFnstenvMov engine, which isbased on a getPC code that uses the fnstenv instruction

Finding the absolute memory address of the decryp-tor is also possible using the fstenv instruction, whichsaves the current FPU operating environment at thememory location specified by its operand [43]. Thestored record includes the instruction pointer of theFPU, which is different than EIP. However, if a float-ing point instruction has been executed as part of thedecryptor, then the FPU instruction pointer will alsopoint to the memory area of the decryptor, and thusfstenv can be used to retrieve its absolute memoryaddress. The same can also be achieved using one of thefstenv, fsave, or fnsave instructions.

Figure 5 shows the decoder generated by the PexFn-stenvMov engine of the Metasploit Framework [40],which uses an fnstenv-based getPC code. By spec-ifying the memory offset to the fstenv relative tothe stack pointer, the absolute memory address of thelatest floating point instruction fldz can be poppedto ebx. By combining the fstenv-based getPC codewith self-modifications, as presented in Sect. 3.2, it ispossible to construct a decoder with no control trans-fer instruction, i.e., with a CFG consisting of a singlenode.

A third getPC technique is possible by exploitingthe structured exception handling (SEH) mechanismof Windows [44]. However this technique works onlywith older versions of Windows, and the introduction ofregistered SEH in Windows XP and 2003 limits its appli-cability. From the tested polymorphic shellcode engines(cf. Sect. 5.2.1), only Alpha2 [45] supports this type ofgetPC, although not by default.

4.1.3 Known operand values

Polymorphic shellcode engines produce generic decryp-tor code for a specific hardware platform that runs inde-pendently of the OS version of the victim host or thevulnerability being exploited. The decoder is constructedwith no assumptions about the state of the process inwhich it will run, and any registers or memory locationsbeing used by the decoder are initialized on the fly. Thisallows us to correctly follow its execution from the very

first instruction, since instruction operands with initiallyunknown values will eventually become available.

For instance, the execution trace of the Countdowndecoder in Fig. 3 is always the same, independently of theprocess in which it has been injected. Indeed, the codeis self-contained, which allows us to correctly executeeven instructions with non-immediate operands whichotherwise would be unknown, as shown from the com-ments next to the code. The emulator can correctly ini-tialize the registers, follow stack operations, compute alleffective addresses, and even follow self modifications,since every operand eventually becomes known.

Note that, depending on the vulnerability, a skilledattacker may be able to construct a non-self-containeddecryptor, which our approach would not be able tofully execute. This could be possible by including in thecomputations of the decoder existing data that reside atknown locations of the memory image of the vulnera-ble process, and which remain consistent across all vul-nerable systems. Such data are not accessible from thenetwork-level emulator, and thus it cannot follow anyinstructions that manipulate them. We further discussthis issue in Sect. 6.

4.2 Detection algorithm

In this section we describe in detail the emulation-basedpolymorphic shellcode detection algorithm. The algo-rithm takes as input a byte stream captured passivelyfrom the network, such as a reassembled TCP streamor the payload of a UDP packet, and reasons whether itcontains polymorphic shellcode. Each input is executedon a CPU emulator as if it was executable code. Dueto the dense instruction set and the variable instruc-tion length of the IA-32 architecture, even non-attackstreams can be interpreted as valid executable code.However, such random code usually stops running soon,e.g., due to the execution of an illegal instruction, whilereal polymorphic code is being executed until theencrypted payload is fully decrypted.

The pseudo-code of the detection algorithm is pre-sented in Fig. 6 with several simplifications for brevity.Each input buffer is mapped to a random location in thevirtual address space of the emulator, as shown in Fig. 7.This corresponds to the placement of the attack vectorinto the input buffer of a vulnerable process. Before eachexecution attempt, the state of the virtual processor israndomized (line 5). Specifically, the EFLAGS register,which holds the flags of conditional instructions, and allgeneral purpose registers are assigned random values,except esp, which is set to point to the middle of thestack of a supposed process.


Fig. 6 Simplifiedpseudo-code for the detectionalgorithm

Payload Reads

Decryptor Encrypted Payload

Input Buffer: ~1-64KB

Virtual Address Space: 4GB

Fig. 7 Memory reads during the decryption of a polymorphicshellcode

4.2.1 Running the shellcode

The main routine, emulate, takes as parameters thestarting address and the length of the input stream.Depending on the vulnerability, the injected code maybe located at an arbitrary position within the stream.For example, the first bytes of a TCP stream or a UDPpacket payload will probably be occupied by protocoldata, depending on the application (e.g., the METHODfield in case of an HTTP request). Since the position ofthe shellcode is not known in advance, the main routineconsists of a loop which repeatedly starts the executionof the supposed code that begins from each and everyposition of the input buffer (line 3). We call a completeexecution starting from position i an execution chainfrom i.

Note that it is necessary to start the execution fromeach position i, instead of starting only from the firstbyte of the stream and relying on the self-synchronizingproperty of the IA-32 architecture [7,8], since we mayotherwise miss the execution of a crucial instruction thatinitializes some register or memory location. For exam-ple, going back to the execution trace of Fig. 3, if the exe-cution misses the first instruction push 0xF, e.g., dueto a misalignment or an overlapping instruction placedin purpose immediately before push, then the emulatorwill not execute the decryptor correctly, since the valueof the ecx register will be arbitrary. Furthermore, theexecution may stop even before reaching the shellcode,e.g., due to an illegal instruction.

For each position pos, the algorithm enters the mainexecution loop (line 6), in which a new instruction isfetched, decoded, the program counter is increased bythe length of the instruction, and finally the instructionis executed. In case of a control transfer instruction,upon its execution, the PC might have been changed tothe address of the target instruction. Since instructiondecoding is an expensive operation, decoded instruc-tions are stored in a translation cache (line 9). If aninstruction at a certain position of the buffer is goingto be executed again, e.g., as part of a loop body in thesame execution chain, or as part of a different executionchain of the same input buffer then the instruction isinstantly fetched from the translation cache.

4.2.2 Optimizing performance

For large input streams, starting a new execution fromeach and every position incurs a high execution over-head per stream. We have implemented the followingoptimization in order to mitigate this effect. The injectedshellcode is treated by the vulnerable application aslegitimate input data. Thus, it should conform to anyrestrictions that input data may have. Since usually theinjected code is treated by the vulnerable application asa string, and strings in C are terminated with a NULLbyte (a byte with zero value), any NULL byte withinthe shellcode will truncate it and render the code non-functional. For this reason, the shellcode cannot containNULL bytes inside its body.

We exploit this restriction by taking advantage of thezero bytes present in binary network traffic. Before start-ing execution from position i, a look-ahead scan is per-formed to find the first zero byte after position i. If a zerobyte is found at position j, and j − i is less than a mini-mum size S, then the positions from i to j are skipped andthe algorithm continues from position j+1. We have cho-sen a rather conservative value for S = 50, given that


most polymorphic shellcodes have a size greater than100 bytes.

In the rare case that a protected application acceptsNULL characters as part of the input data, this optimi-zation should be turned off. On the other hand, if theapplication protocol has other restricted bytes, which isquite common [40], extending the above optimization toconsider these bytes instead of the zero byte would dra-matically improve performance. For instance, the HTTPprotocol defines that the request header should be sepa-rated from the message body by a CRLF byte combina-tion. Since the two parts of an HTTP request are usuallytreated separately by web servers, we could extend theabove optimization to also consider CRLF byte combi-nations in case of HTTP traffic.

4.2.3 Detection heuristic

While the execution behavior of random code is unde-fined, there exists a generic execution pattern inherentto all polymorphic shellcodes, which allows us to accu-rately distinguish polymorphic code injection attacksfrom benign requests. Upon the hijack of the programcounter, the control flow of the vulnerable process isdiverted—sometimes through a NOP sled—to theinjected shellcode, and particular to the polymorphicdecryptor. During decryption, the decryptor reads thecontents of the memory locations where the encryptedpayload has been stored, decrypts them, and writes backthe decrypted data. Hence, the decryption process willresult in many memory accesses to the memory regionwhere the input buffer has been mapped to. Since thisregion is a very small part of the virtual address space, weexpect that memory reads from that area would occurrarely during the execution of random code.

Only instructions with a memory operand can poten-tially result in a memory read from the input buffer. Thismay happen if the absolute address that is specified bya direct memory operand, or if the computation of theeffective address of an indirect memory operand, cor-responds to an address within the input buffer. Inputstreams are mapped to a random memory location ofthe 4GB virtual address space. Additionally, before eachexecution, the CPU registers, some of which normallytake part in the computation of the effective address, arerandomized. Thus, the probability to encounter an acci-dental read from the memory area of the input buffer inrandom code is very low. In contrast, the decryptor willaccess tens or hundreds of different memory locationswithin the input buffer, as depicted in Fig. 7, dependingon the size of the encrypted payload and the decryptionfunction.

This observation led us to initially choose the numberof reads from distinct memory locations of the inputbuffer as the detection criterion. For the sake of brevity,we refer to memory reads from distinct locations of theinput buffer as “payload reads.” For a given executionchain, a number of payload reads greater than a certainpayload reads threshold (PRT) gives an indication forthe execution of a polymorphic shellcode.

We expected random code to exhibit a low payloadreads frequency, which would allow for a small PRTvalue, much lower than the typical number of payloadreads found in polymorphic shellcodes. Preliminaryexperiments with network traces showed that the fre-quency of payload reads in random code is very small,and usually only a few of the incoming streams had exe-cution chains with just one to ten payload reads. How-ever, there were rare cases with execution chains thatperformed hundreds of payload reads. This was usuallydue to the accidental formation of a loop with an instruc-tion that happened to read hundreds of different mem-ory locations from the input buffer. Since we expectedrandom code to exhibit a low number of payload reads,such behavior would have been flagged as polymorphicshellcode by our initial criterion, which would result infalse positives.

Since one of our primary goals is to have practicallyzero false positives, we addressed this issue by defin-ing a more strict criterion. As discussed in Sect. 4.1.2,a mandatory operation of every polymorphic shellcodeis to find its absolute memory address through the exe-cution of some form of getPC code. This led us to aug-ment the detection criterion as follows: if an executionchain of an input stream executes some form of getPCcode, followed by PRT or more payload reads, then thestream is flagged to contain polymorphic shellcode. Wediscuss in detail this criterion and its effectiveness interms of false positives in Sect. 5.1. The experimentalevaluation showed that the above heuristic allows foraccurate detection of polymorphic shellcode with zerofalse positives.

Another option for enhancing the detection heuristicwould be to look for linear payload reads from a con-tiguous region of the input buffer. However, this heu-ristic can be tricked by splitting the encrypted payloadinto nonadjacent parts which can then be decrypted inrandom order [46].

4.2.4 Ending execution

An execution chain may end for one of the follow-ing reasons: (i) an illegal or privileged instruction isencountered, (ii) the control is transferred to an invalid


or unknown memory location, or (iii) the number ofexecuted instructions has exceeded a certain threshold.

4.2.4.1 Invalid instruction The execution may stop ifan illegal or privileged instruction is encountered(line 10). Since privileged instructions can be invokedonly by the OS kernel, they cannot take part in the nor-mal shellcode execution. Although an attacker couldintersperse invalid or privileged instructions in theinjected code to hinder detection, these should comewith corresponding control transfer instructions thatwould bypass them during execution—otherwise theshellcode would not execute correctly. In that case, theemulator will also follow the real execution of the code,so such instructions will not cause any inconsistency. Atthe same time, privileged or illegal instructions appearrelatively often in random data, helping this way thedetector to distinguish between benign requests andattack vectors.

4.2.4.2 Invalid memory location Normally, duringthe execution of the decoder, the program counter willpoint to addresses of the memory region of the inputbuffer where the injected code resides. However, highlyobfuscated code could use the stack for storing someparts, or all of the decrypted code, or even for “pro-ducing” useful instructions on the fly, in a way similarto the self-modifications presented in Sect. 3.2. Thus,the flow of control may jump from the original codeof the decryptor to some generated instruction in thestack, then jump back to the input buffer, and so on.In fact, since the shellcode is the last piece of codethat will be executed as part of the vulnerable process,the attacker has the flexibility to write in any memorylocation mapped in the address space of the vulnerableprocess [47].

Although it is generally difficult to know in advancethe contents of a certain memory location, since theyusually vary between different systems, it is easier tofind virtual memory regions that are always mappedinto the address space of the vulnerable process. Forexample, if address space randomization is not appliedextensively, the attacker might know in advance somememory regions of the stack or heap that exist in everyinstance of the vulnerable process.

The emulator cannot execute instructions that readunknown memory locations because their contents arenot available to the network-level detector. Such instruc-tions are ignored and the execution continues normally.Otherwise, an attacker could trick the emulator by plac-ing NOP-like instructions that read arbitrary data frommemory locations known in advance to belong to theaddress space of the application. However, the emu-lator keeps track of any memory locations outside ofthe input buffer that are written during execution, and

marks them as valid memory locations where useful dataor code may have been placed. If at any time the pro-gram counter points to such an address, the executioncontinues normally from that location. In contrast, if thePC points to an address outside the input buffer that hasnot been written during the particular execution, thenthe execution stops (line 15). In random binary code,this usually happens when the PC reaches the end of theinput buffer.

Note that if an attacker knows in advance some mem-ory locations of the vulnerable process that contain codewhich can be used as part of the shellcode, then the emu-lator would not be able to fully execute it. We furtherdiscuss this issue in Sect. 6.

4.2.4.3 Execution threshold There are situations inwhich the execution of random code might not stopsoon, or even not at all, due to large code blocks with nobackward branches that are executed linearly, or due tothe occurrence of backwards jumps that form seemingly“endless” or infinite loops. In such cases, an executionthreshold (XT) is necessary for avoiding extensive per-formance degradation or execution hang ups (line 16).

An attacker could exploit this and evade detection byplacing a loop before the decryptor which would executeenough instructions to exceed the execution thresholdbefore the code of the actual decryptor is reached. Wecannot simply skip such loops, since the loop body couldperform a crucial computation for the further correctexecution of the decoder, e.g., computing the decryp-tion key. Fortunately, endless loops occur with low fre-quency in normal traffic, as discussed in Sect. 5.3. Thus,an increase in input requests with execution chains thatreach the execution threshold due to a loop might bean indication of a new attack outbreak using the aboveevasion method.

4.2.5 Infinite loop squashing

To further mitigate the effect of seemingly endless loops,we have implemented a heuristic for identifying andstopping the execution of provably infinite loops thatmay occur in random code. Loops are detected dynam-ically using the method proposed by Tubella et al.[48].This technique detects the beginning and the termina-tion of iterations and loop executions in run-time usinga Current Loop Stack that contains all loops that arebeing executed at a given time.

The following infinite loop cases are detected: (i)there is an unconditional backward branch from addressS to address T, and there is no control transfer instruc-tion in the range [T,S] (the loop body), and (ii) thereis a conditional backward branch from address S toaddress T, none of the instructions in the range [T,S] is a


(a) (b)

Fig. 8 Infinite loops in random code due to (a) unconditional and(b) conditional branches

control transfer instruction, and none of the instruc-tions in the range [T,S] affects the status flag(s) of theEFLAGS register on which the conditional branchdepends on.

Examples of the two infinite loop cases are presentedin Fig. 8. In example (b), when control reaches the rorinstruction at address 0x0F30, the parity flag (PF) isalready set as a result of some previous instruction. Roraffects only the CF and OF flags, stc affects only the CFflag, which it sets to 1, and mov and fnop do not affectany flags. Since none of the instructions in the loop bodyaffects the PF, its value will not change until the jump-if-parity instruction is executed, which will jump back tothe ror instruction, resulting to an infinite loop.

Clearly, these are very simple cases, and more com-plex infinite loop structures may arise. Our experimentshave shown that, depending on the monitored traffic,the above heuristics prune about 3–6% of the executionchains that stop due to reaching the execution thresh-old. Loops in random code are usually not infinite, butseemingly “endless,” being executed for a very largenumber of iterations until completion. Thus, the runtimeoverhead of any more elaborate infinite loop detectionmethod will be higher than the overhead of simply run-ning the extra infinite loops that may arise until theyreach the execution threshold.

4.3 Implementation

In this section we describe the prototype implementa-tion of our network-level detector. The detector pas-sively captures network packets usinglibpcap [49] andreassembles TCP/IP streams using libnids [50]. Theinput buffer size is set to 64KB, which is large enough fortypical service requests. Especially for web traffic, pipe-lined HTTP/1.1 requests through persistent connectionsare split to separate streams. Otherwise, an attackercould evade detection by filling the stream with benignrequests until exceeding the buffer size.

Instruction set simulation has been implementedinterpretively, with a typical fetch, decode, and executecycle. Accurate instruction decoding, which is crucialfor the identification of invalid instructions, is performedusing libdasm [51]. For our prototype, we have

implemented a subset of the IA-32 instruction set,including most of the general-purpose instructions, butno FPU, MMX, SSE, or SSE2 instructions, exceptfstenv/fnstenv, fsave/fnsave, and rdtsc.However, all instructions are fully decoded, and if duringexecution an unimplemented instruction is encountered,the emulator proceeds normally to the next instruction.

The implemented subset suffices for the complete andcorrect execution of the decryption part of all the testedshellcodes (cf. Sect. 5.2.1). Even the highly obfuscatedshellcodes generated by the TAPiON engine [20], whichintersperses FPU instructions among the decoder code,are executed correctly, since the FPU instructions areused only as NOPs and do not take part in the usefulcomputations of the decoder.

5 Experimental evaluation

In this section we evaluate the performance of the pro-posed approach using our prototype implementation.In all experiments, the detector was running on a PCequipped with a 2.53 GHz Pentium 4 processor and 1 GBRAM, running Debian Linux (kernel v2.6.7). For trace-driven experiments, we used full packet traces of trafficfrom ports related to the most exploited vulnerabilities,captured at ICS-FORTH and the University of Crete.Trace details are summarized in Table 1. Since remotecode-injection attacks are performed using a speciallycrafted request to a vulnerable service, we keep onlythe client-to-server traffic of network flows. For largeincoming TCP streams, e.g., due to a file upload, wekeep only the first 64KB. Note that these traces rep-resent a significantly smaller portion of the total traf-fic that passed by through the monitored links duringthe monitoring period, since we keep only the client-initiated traffic.

5.1 Tuning the detection heuristic

The major drawback of anomaly-based or heuristics-based attack detection methods is their relatively highfalse positive ratio. Such methods should have negligible

Table 1 Characteristics of client-to-server network traffic traces

Service Port Number Number Total sizeof streams

www 80 1759950 1.72 GBNetBIOS 137–139 246888 311 MBmicrosoft-ds 445 663064 912 MB


false positive ratio in order to be useful. Since theproposed approach is based on a heuristic detectionmethod, we first assess the possibility that the detectionalgorithm incorrectly detects benign data as polymor-phic shellcode.

As discussed in Sect. 4.2.3, the detection criterionrequires the execution of some form of getPC code,followed by a number of payload reads that exceeda certain threshold. Our initial implementation of thisheuristic was the following: if an execution chain con-tains a call, fstenv, fnstenv, fsave, or fnsaveinstruction, followed by PRT or more payload reads,then it belongs to a polymorphic shellcode. There existfour different versions of the call instruction in theIA-32 instruction set. The existence of one of these eightinstructions serves just as an indication of the potentialexecution of getPC code. Only when combined with theexecution of several payload reads, it gives a good indi-cation of the execution of a polymorphic shellcode.

5.1.1 Evaluation with real traffic

We evaluated this heuristic using the client-to-serverrequests from the traces presented in Table 1 as inputto the detection algorithm. Only 13 out of the 2,669,902streams were found to contain an execution chain with acall or fstenv instruction followed by payload reads,and all of them had non-ASCII content. In the worstcase, there were five payload reads, allowing for a min-imum value for PRT = 6. However, since the false pos-itive rate is a crucial factor for the applicability of ourdetection method, we further explored the quality of thedetection heuristic using a significantly larger data set.

5.1.2 Evaluation with synthetic requests

We generated two million streams of varying sizes uni-formly distributed between 512 bytes and 64 KB withrandom binary content. From our experience, binarydata is much more likely to give false positives thanASCII only data. The total size of the data set was 61 GB.The results of the evaluation are presented in Table 2,under the column “Initial Heuristic.”

From the two million streams, 556 had an executionchain that contained a getPC instruction followed bypayload reads. Although 475 out of the 556 streams hadat most six payload reads, there were 44 streams withtens of payload reads, and 37 streams with more than 100payload reads, reaching 416 payload reads in the mostextreme case. As we show in Sect. 5.2.1, there are poly-morphic shellcodes that execute as few as 32 payloadreads. As a result, PRT cannot be set to a value greater

Table 2 Streams that matched the detection heuristic with a givennumber of payload reads

Payload Streams

Reads Initial Heuristic Improved Heuristic

# % # %

1 409 0.02045 22 0.001102 39 0.00195 5 0.000253 10 0.00050 3 0.000154 9 0.00045 1 0.000055 3 0.00015 1 0.000056 5 0.00025 1 0.000057–100 44 0.00220 0 0100–416 37 0.00185 0 0

than 32 since it would otherwise miss some polymorphicshellcodes. Thus, the above heuristic incorrectly identi-fies these cases as polymorphic shellcodes.

5.1.3 Defining a stricter detection heuristic

Although only the 0.00405 % of the total streams resultedto a false positive, we can devise an even more strict cri-terion to further lower the false positive rate.

Payload reads occur in random code whenever thememory operand of an instruction accidentally refers toa memory location within the input buffer. In contrast,the decoder of a polymorphic shellcode explicitly refersto the memory region of the encrypted payload basedon the value of the instruction pointer that is pushed inthe stack by a call instruction, or stored in the mem-ory location specified in an fstenv instruction. Thus,after the execution of a call or fstenv instruction,the next mandatory step of a getPC code is to (not nec-essarily immediately) read the instruction pointer fromthe memory location where it was stored.

This observation led us to further enhance the detec-tion criterion as follows: if an execution chain containsone of the eight different call, fstenv, or fsaveinstructions, followed by a read from the memory loca-tion where the instruction pointer was stored as a result ofone of the above instructions, followed by PRT or morepayload reads, then it belongs to a polymorphic shellcode.

Using the same data set, the enhanced heuristic resultsto significantly fewer matching streams, as shown inTable 2, under the column “Enhanced Heuristic.” Inthe worst case, one stream had an execution chain witha call instruction, an accidental read from the mem-ory location of the stack where the return address waspushed, and six payload reads. There were no streamswith more than six payload reads, which allows for alower bound for PRT = 7.


5.2 Validation

5.2.1 Polymorphic shellcode execution

We tested the capability of the emulator to correctlyexecute polymorphic shellcodes using real samples pro-duced by off-the-shelf polymorphic shellcode engines.We generated mutations of an 128 byte shellcode usingthe Clet [18], ADMmutate [17], and TAPiON [20]polymorphic shellcode engines, and the Alpha2 [45],Countdown, JmpCallAdditive, Pex, PexFnstenvMov,PexFnstenvSub, and ShigataGaNai shellcode encryp-tion engines from the Metasploit Framework [40].

TAPiON, the most recent of the engines, produceshighly obfuscated code using anti-disassembly and anti-emulator techniques, many garbage instructions, codeblock transpositions, and on-the-fly instruction genera-tion. In several cases, the decryptor produces on-the-flysome code in the stack, jumps to it, and then jumps backto the original decryptor code.

For each engine, we generated 1000 instances of theoriginal shellcode. For engines that support optionsrelated to the obfuscation degree, we split the 1000 sam-ples evenly using all possible parameter combinations.The execution of each sample stops when the completeoriginal shellcode is found in the memory image of theemulator.

Figure 9 shows the average number of executedinstructions that are required for the complete decryp-tion of the payload for the 1000 samples of each engine.The ends of range bars, where applicable, correspondto the samples with the minimum and maximum num-ber of executed instructions. In all cases, the emula-tor decrypts the original shellcode correctly. Figure 10shows the average number of payload reads for thesame experiment. For simple encryption engines, thedecoder decrypts four bytes at a time, resulting to 32payload reads. ADMmutate decoders read either oneor four bytes at a time. On the other extreme, shell-codes produced by the Alpha2 engine perform morethan 500 payload reads. Alpha2 produces alphanumericshellcode using a considerably smaller subset of theIA-32 instruction set, which forces it to execute muchmore instructions in order to achieve the same goals.

Given that 128 bytes is a rather small size for a func-tional payload, these results can be used to derive anindicative upper bound for PRT = 32 (a higher valuewould miss such small shellcodes). Combined with theresults of the previous section, which showed that theenhanced heuristic is very resilient to accidental pay-load reads, this allows for a range of possible valuesfor PRT from 7 to 31. For our experiments we choosefor PRT the median value of 19, which allows for even

ADMmutateClet

Alpha2Countdown

JmpCallAdditivePex

PexFnstenvMovPexFnstenvSub

ShikataGaNaiTAPiON

Executed instructions

32 64 128 256 512 1024 2048 4096 8192

Fig. 9 Average number of executed instructions for the completedecryption of an 128 byte shellcode encrypted using different poly-morphic and encryption engines

ADMmutateClet

Alpha2Countdown

JmpCallAdditivePex

PexFnstenvMovPexFnstenvSub

ShikataGaNaiTAPiON

Payload reads

8 16 32 64 128 256 512

Fig. 10 Average number of payload reads for the completedecryption of an 128 byte shellcode encrypted using different poly-morphic and encryption engines

more extreme cases of accidental payload reads not to bemisclassified as true positives, while at the same time cancapture even smaller shellcodes.

5.2.2 Detection effectiveness

To test the efficacy of our detection method, we launcheda series of remote code-injection attacks using theMetasploit Framework [40] against an unpatched Win-dows XP host running Apache v1.3.22. Attacks werelaunched from a Linux host using Metasploit’s exploitsfor the following vulnerabilities: Apache win32 chunkedencoding [52], Microsoft RPC DCOM MS03-026 [53],and Microsoft LSASS MS04-011 [54]. The detector wasrunning on a third host that passively monitored theincoming traffic of the victim host. For the exploit pay-load we used the shellcode win32_reverse, whichconnects back to the attacking host and spawns a shell,encrypted using different engines. We tested all com-binations of the three exploits with the engines pre-sented in the previous section. All attacks were detectedsuccessfully, with zero false negatives.

5.3 Processing cost

In this section we evaluate the raw processing speed ofour prototype implementation using the network traces


Execution threshold

256 512 1024 2048 4096 8192 16384 32768

Thr

ough

put (

Mbi

t/s)

0

40

80

120

160 port 139port 445port 80

Fig. 11 Processing speed for different execution thresholds

Execution threshold256 512 1024 2048 4096 8192 16384 32768

Str

eam

s re

ache

d th

resh

old

(%)

0

2

4

6

8

10

12

14port 139port 445port 80

Fig. 12 Percentage of streams that reach the execution threshold

presented in Table 1. Although emulation is a CPU-intensive operation, our aim is to show that it is feasibleto apply it for network-level polymorphic attack detec-tion. One of the main factors that affect the processingspeed of the emulator is the execution threshold beyondwhich an execution chain stops. The larger the XT, themore the processing time spent on streams with longexecution chains. As shown in Fig. 11, as XT increases,the throughput decreases, especially for ports 139 and445. The reason for the linear decrease of the through-put for these ports is that some streams have very longexecution chains that always reach the XT, even whenit is set to large values. For higher execution thresholds,the emulator spends even more cycles on these chains,which decreases the overall throughput.

We further explore this effect in Fig. 12, which showsthe percentage of streams with an execution chain thatreaches a given execution threshold. As XT increases,the number of streams that reach it decreases. This effectoccurs only for low XT values due to large code blockswith no branch instructions that are executed linearly.For example, the execution of linear code blocks withmore than 256 but less than 512 valid instructions is

terminated before reaching the end when using athreshold of 256, but completes correctly with a thresh-old of 512. However, the occurrence probability of suchblocks is reversely proportional to their length, due tothe illegal or privileged instructions that accidentallyoccur in random code. Thus, the percentage of streamsthat reach the execution threshold stabilizes beyond thevalue of 2048. After this value, XT is reached solely dueto execution chains with “endless” loops, which usuallyrequire a prohibitive number of instructions in order tocomplete.

In contrast, port 80 traffic behaves differently becausethe ASCII data that dominate in web requests producemainly forward jumps, making the occurrence of end-less loops extremely rare. Therefore, beyond an XT of2048, the percentage of streams with an execution chainthat stops due to reaching the execution threshold isnegligible, reaching 0.12 %. However, since ASCII webrequests do not contain any null bytes, the zero-delim-ited chunks optimization does not reduce the numberof execution chains per stream, which results to a lowerprocessing speed.

We should stress at this point that these results referto the raw processing speed of the detector, which meansthat under normal operation will be able to inspect traf-fic of higher speeds, since usually the incoming trafficto some service is less compared to the outgoing traf-fic. For example, the outgoing traffic from typical webservers is much more than the incoming traffic, becauseusually the content of web pages is larger than the sizeof incoming requests. Indeed, a study of the web servertraffic at FORTH and the University of Crete for oneweek showed that from the total traffic, 1.5 % and 14 %was incoming traffic, and 98.5 % and 86 % was outgoingtraffic, respectively.

Figures 11 and 12 represent two conflicting trade-offs related to the execution threshold. Presumably, thehigher the processing speed, the better, which leadstowards lower XT values. On the other hand, as dis-cussed in Sect. 4.2.4.3, it is desirable to have as fewstreams with execution chains that reach the XT as possi-ble. This leads towards higher XT values, which increasethe visibility of endless loop attacks. Regarding this sec-ond requirement, XT values higher than 2048 do notoffer any improvement to the percentage of streamsthat reach it. After an XT of 2048, the percentage ofstreams that reach it stabilizes at 2.65% for port 139 and4.08% for port 445.

At the same time, an XT of 2048 allows for a quitedecent processing speed, especially when taking intoaccount that live incoming traffic will usually have rel-atively lower volume than the monitored link’s band-width, especially if the protected services are not related


Execution threshold

0 2000 4000 6000 8000 10000 12000 14000

Pay

load

rea

ds

0

100

200

300

400

500

600

700

Alpha2TAPiONADMmutateClet

0 100 200 300 400 500 6000

10

20

30

40

Fig. 13 The average number of payload reads of Fig. 10 that agiven execution threshold allows to be executed. All decryptorsperform approximately 20 payload reads within the first 300 exe-cuted instructions

to file uploads. We should also stress at this point thatour prototype is highly unoptimized. For instance, anemulator implemented using threaded code [55], com-bined with optimizations such as lazy condition codeevaluation [56], would result to better performance.

A final issue that we should take into account is toensure that the selected execution threshold allows poly-morphic shellcodes to perform enough payload readsto reach the payload reads threshold and be success-fully detected. As shown in Sect. 5.2.1, the completedecryption of some shellcodes requires the executionof even more than 10000 instructions, which is muchhigher than an XT as low as 2048. However, as shown inFig. 13, even lower XT values, which give better through-put for binary traffic, allow for the execution of morethan enough payload reads. For example, in all cases,the chosen PRT value of 19 is reached by executing only300 instructions.

6 Limitations

6.1 Non-self-modifying shellcode

A fundamental limitation of our method is that it detectsonly polymorphic shellcodes that decrypt their bodybefore executing their actual payload. Plain or com-pletely metamorphic shellcodes that do not performany self-modifications are not captured by our detec-tion heuristic. However, we have yet to see a purelymetamorphic shellcode engine implementation, whilepolymorphic engines are becoming more prevalent andcomplex [20], mainly for the following two reasons.

First, polymorphic shellcode is increasingly used forevading intrusion detection systems. Second, the ever

increasing functionality of recent shellcodes makes theirconstruction more complex, while at the same time theircode should not contain NULL and, depending on theexploit, other restricted bytes, such as CR, LF, SP, VT,and others. Thus, it is easier for shellcode authors toavoid such bytes in the code by encoding its body usingan off-the-shelf encryption engine, rather than havingto handcraft the shellcode [1]. In many cases the latteris non-trivial, since many exploits require the avoidanceof many restricted bytes [40]. There are also cases whereeven more strict constraints should be handled, such asthat the shellcode should survive processing from func-tions like toupper(), or that it should be composedonly by printable ASCII characters [19,45].

6.2 Non-self-contained shellcode

Our method works only with self-contained shellcode.Although current polymorphic shellcode engines pro-duce self-contained code, a motivated attacker couldevade network-level emulation by constructing a non-self-contained shellcode that involves registers or mem-ory locations with a priori known values that remainconstant across all vulnerable systems. For example, ifit is known in advance that the address 0x40038EF0in the vulnerable process’ address space contains theinstruction ret, then the shellcode can be obfuscatedby inserting the instruction call 0x40038EF0 at anarbitrary position in the decoder code. Although thiswill have no effect to the actual execution of the shell-code, since the flow of control will simply be transferredto address 0x40038EF0, and from there immediatelyback to the decoder code, due to the ret instruction,the network-level emulator will not execute it correctly,since it cannot follow the jump to address0x40038EF0.

However, the extended use of hardcoded addressesresults in more fragile code [1], as they tend to changeacross different software and OS versions, especiallyas address space randomization schemes are becom-ing more prevalent [57]. In our future work, we planto explore ways to augment the network-level detectorwith host-level information, such as the invariant partsof the address space of the protected processes, in orderto make it more robust to such obfuscations.

6.3 Endless loops

Another possible evasion method is the placement ofendless loops for reaching the execution thresholdbefore the actual decryptor code runs. Although thisis a well-known problem in the context of virus scan-ners for years, if attackers start to employ such evasiontechniques, our method will still be useful as a first-stage


anomaly detector for application-aware NIDS likeshadow honeypots [58], given that the appearance ofendless loops in random code is rare, as shown inSect. 5.3.

6.4 Transformations beyond the transport layer

Shellcodes contained in compressed HTTP/1.1 connec-tions, or unicode-proof shellcodes [47], which becomefunctional after being transformed according to the uni-code encoding by the attacked service, are not executedcorrectly by our prototype. This is an orthogonal issuethat can be addressed by reversing the encoding used ineach case by the protected service through appropriatefilters before the emulation stage.

Generally, network data that are being transformedabove the transport layer, before reaching the core appli-cation code, cannot always be effectively inspected usingpassive network monitoring, as for example in case ofencrypted SSL or HTTPS connections. In such cases,our technique can still be applied by moving it from thenetwork-level to a proxy that first decrypts the trafficbefore scanning it [59]. Another option is to integratethe detector to the end hosts, either at the socket level, byintercepting calls that read network input trough libraryinterposition [60], or at the application level as an exten-sion to the protected service, e.g., as module for theApache web server [9].

7 Conclusion

We have considered the problem of detecting previouslyunknown polymorphic code injection attacks at the net-work level. The main question is whether highly obfus-cated versions of such attacks can be identified purelybased on the limited information available through pas-sive network traffic monitoring.

The starting point for our work is the observationthat previous proposals that rely on static analysis areinsufficient, because they can be bypassed using tech-niques such as simple self-modifications. In responseto this observation, we explore the feasibility of per-forming more accurate analysis through network-levelexecution of potential shellcodes, by employing a fully-blown processor emulator on the NIDS side. We haveexamined the execution profiles of a large number ofshellcodes produced using various polymorphic shell-code engines, and identified properties that can distin-guish polymorphic shellcodes from normal traffic withreasonable accuracy. Our analysis indicates that ourapproach can detect all known classes of polymorphicshellcodes, including those that employ certain forms of

self-modifications that are not detected by previous pro-posals. Furthermore, our experiments suggest that thecost of our approach is modest.

However, further analysis on the robustness of ourapproach also revealed that attackers can succeed incircumventing our techniques if the shellcode is notself-contained. In particular, the attacker can leveragecontext not available at the network level for buildingshellcodes that cannot be unambiguously executed onthe network level processor emulator. Detecting suchattacks remains an open problem.

One way of tackling this problem is to feed the nec-essary host-level information to the NIDS, as suggestedin [61], but the feasibility of doing so is yet to be proven.A major concern is that, in most cases, bypassing shell-code detection techniques, including our own, has beenrelatively straightforward, and appears to carry no addi-tional cost or risks for the attacker. Thus, these tech-niques do not necessarily “raise the bar” for the attacker,while their cost for the defender in terms of the resourcesthat need to be devoted to detection can be significant.At this point, it remains unclear whether accurate net-work level detection is feasible. Nevertheless, we believethat the work described in this paper brings us one stepcloser to answering this question.

Acknowledgment This work was supported in part by the projectsCyberScope, EAR, and Miltiades, funded by the Greek GeneralSecretariat for Research and Technology under contract numbersPENED 03ED440, USA-022, 05NON-EU-109, respectively, andby the FP6 project NoAH funded by the European Union undercontract number 011923. Michalis Polychronakis and EvangelosP. Markatos are also with the University of Crete.

References

1. sk, History and advances in windows shellcode. Phrack11(62), (2004)

2. Kim, H.-A., Karp, B.: Autograph: toward automated, distrib-uted worm signature detection. In: Proceedings of the 13thUSENIX Security Symposium, pp. 271–286, (2004)

3. Singh, S., Estan, C., Varghese, G., Savage, S.: Automatedworm fingerprinting. In: Proceedings of the 6th Symposiumon Operating Systems Design & Implementation (OSDI),(2004)

4. Newsome, J., Karp, B., Song, D.: Polygraph: automaticallyGenerating signatures for polymorphic worms. In: Proceed-ings of the IEEE Security & Privacy Symposium, pp. 226–241,(2005)

5. Tang, Y., Chen, S.: Defending against internet worms: asignature-based approach. In: Proceedings of the 24th AnnualJoint Conference of IEEE Computer and Communicationsocieties (INFOCOM), (2005)

6. Wang, K., Stolfo, S.J.: Anomalous payload-based networkintrusion detection. In: Proceedings of the 7th InternationalSymposium on Recent Advanced in Intrusion Detection(RAID), pp. 201–222, (2004)


7. Kruegel, C., Kirda, E., Mutz, D., Robertson, W., Vigna, G.:Polymorphic worm detection using structural information ofexecutables. In: Proceedings of the International Symposiumon Recent Advances in Intrusion Detection (RAID), (2005)

8. Chinchani, R., Berg, E.V.D.: A fast static analysis approach todetect exploit code inside network flows. In: Proceedings ofthe International Symposium on Recent Advances in Intru-sion Detection (RAID), (2005)

9. Wang, X., Pan, C.-C., Liu, P., Zhu, S.: Sigfree: a signature-free buffer overflow attack blocker. In: Proceedings of theUSENIX Security Symposium (2006)

10. Li, Z., Sanghi, M., Chen, Y., Kao, M.-Y., Chavez, B.:Hamsa: fast signature generation for zero-day polymorphicworms with provable attack resilience. In: Proceedings of the2006 IEEE Symposium on Security and Privacy, pp. 32–47,2006

11. Ször, P.: The art of computer virus research and defense.Addison-Wesley Professional, (2005)

12. Ször, P., Ferrie, P.: Hunting for metamorphic. In: Proceedingsof the virus bulletin conference. pp. 123–144, (2001)

13. Christodorescu, M., Jha, S.: Static analysis of executables todetect malicious patterns. In: Proceedings of the 12th USE-NIX Security Symposium (Security’03), (2003)

14. Roesch, M.: Snort: lightweight intrusion detection for net-works. In: Proceedings of USENIX LISA ’99, November1999, (software available from http://www.snort.org/)

15. Paxson, V.: Bro: a system for detecting network intruders inreal-time. In: Proceedings of the 7th USENIX Security Sym-posium, (1998)

16. Jordan, C.: Writing detection signatures. USENIXLogin 30(6), 55–61 (2005)

17. K2, ADMmutate, http://www.ktwo.ca/ADMmutate-0.8.4.tar.gz, (2001)

18. Detristan, T., Ulenspiegel, T., Malcom, Y., Underduk,M.: Polymorphic shellcode engine using spectrum analysis.Phrack 11(61), (2003)

19. Rix, Writing IA32 alphanumeric shellcodes. Phrack 11(57),(2001)

20. Bania, P.: TAPiON, http://pb.specialised.info/all/tapion/,(2005)

21. Toth, T., Kruegel, C.: Accurate buffer overflow detection viaabstract payload execution. In: Proceedings of the 5th Sym-posium on Recent Advances in Intrusion Detection (RAID),(2002)

22. Akritidis, P., Markatos, E.P., Polychronakis, M.,Anagnostakis, K.: STRIDE: Polymorphic sled detec-tion through instruction sequence analysis. In: Proceedingsof the 20th IFIP International Information SecurityConference (IFIP/SEC), (2005)

23. Crandall, J.R., Wu, S.F., Chong, F.T.: Experiences usingminos as a tool for capturing and analyzing novel wormsfor unknown vulnerabilities. In: Proceedings of the Confer-ence on Detection of Intrusions and Malware & VulnerabilityAssessment (DIMVA), (2005)

24. Pasupulati, A., Coit, J., Levitt, K., Wu, S., Li, S., Kuo, J.,Fan, K.: Buttercup: on network-based detection of polymor-phic buffer overflow vulnerabilities. In: Proceedings of theNetwork Operations and Management Symposium (NOMS),pp. 235–248, (2004)

25. Pincus, J., Baker, B.: Beyond stack smashing: recentadvances in exploiting buffer overflows. IEEE Security Pri-vacy 2(4), 20–27 (2004)

26. Kreibich, C., Crowcroft, J.: Honeycomb–creating intrusiondetection signatures using honeypots. In: Proceedings of theSecond Workshop on Hot Topics in Networks (HotNets-II),(2003)

27. Kolesnikov, O., Dagon, D., Lee, W.: Advanced polymor-phic worms: evading IDS by blending in with normaltraffic. In: College of Computing, Georgia Instituteof Technology, Atlanta, GA 30332, http://www.cc.ga-tech.edu/ ok/w/ok_pw.pdf, (2004)

28. Newsome, J., Karp, B., Song, D.: Paragraph: thwarting signa-ture learning by training maliciously. In: Proceedings of the9th International Symposium on Recent Advances in Intru-sion Detection (RAID), (2006)

29. Payer, U., Teufl, P., Lamberger, M.: Hybrid enginefor polymorphic shellcode detection. In: Proceedings ofthe conference on detection of intrusions and mal-ware and vulnerability assessment (DIMVA), pp. 19–31,(2005)

30. Linn, C., Debray, S.: Obfuscation of executable code toimprove resistance to static disassembly. In: Proceedings ofthe 10th ACM conference on Computer and communicationssecurity (CCS), pp. 290–299, (2003)

31. Aycock, J., deGraaf, R., Jacobson, M.: Anti-disassembly usingcryptographic hash functions. Department of Computer Sci-ence, University of Calgary, Technical Report, pp. 793–824,(2005)

32. Venable, M., Chouchane, M.R., Karim, M.E., Lakhotia, A.:Analyzing memory accesses in obfuscated x86 executables.In: Proceedings of the conference on detection of intru-sions and malware and vulnerability assessment (DIMVA),(2005)

33. Collberg, C.S., Thomborson, C.: Watermarking, tamper-proffing, and obfuscation: tools for software protection. IEEETrans. Softw. Eng. 28(8), 735–746 (2002)

34. Wang, C., Hill, J., Knight, J., Davidson, J.: Softwaretamper resistance: Obstructing static analysis of pro-grams. University of Virginia, Technical Report CS-2000–12,(2000)

35. Madou, M., Anckaert, B., Moseley, P., Debray, S., Sutter, B.D.,Bosschere, K.D.: Software protection through dynamic codemutation. In: Proceedings of the 6th International Workshopon Information Security Applications (WISA), pp. 194–206,(2005)

36. Schwarz, B., Debray, S., Andrews, G.: Disassembly of exe-cutable code revisited. In: Proceedings of the ninth workingconference on reverse engineering (WCRE), (2002)

37. Prasad, M., cker Chiueh, T.: A binary rewriting defenseagainst stack based overflow attacks. In: Proceedings of theUSENIX annual technical conference, (2003)

38. Kruegel, C., Robertson, W., Valeur, F., Vigna, G.: Sta-tic disassembly of obfuscated binaries. In: Proceed-ings of the USENIX security symposium, pp. 255–270,(2004)

39. Cohen, F.B.: Operating system protection through programevolution. Comput. Sec. 12(6), 565–584 (1993)

40. Metasploit project, http://www.metasploit.com/, (2006)41. Cifuentes, C., Gough, K.J.: Decompilation of binary pro-

grams. Softw. Prac. Exp. 25(7), 811–829 (1995)42. Balakrishnan, G., Reps, T.: Analyzing memory accesses in x86

executables. In: Proceedings of the International Conferenceon Compiler Construction (CC), (2004)

43. Noir, GetPC code (was: Shellcode from ASCII), http://www.securityfocus.com/ archive/82/327100/2006-01-03/1, June 2003

44. Ionescu, C.: GetPC code (was: Shellcode from ASCII), http://www.securityfocus.com/archive/82/327348/2006-01-03/1, July2003

45. Wever, B.-J.: Alpha 2, (2004), http://www.edup.tudelft.nl/ bjw-ever/src/alpha2.c

46. Perriot, F., Ferrie, P., Ször, P.: Striking similarities. Virus Bull.,pp. 4–6, (2002)


47. Obscou, Building IA32 ‘unicode-proof’ shellcodes. Phrack11(61), (2003)

48. Tubella, J., González, A.: Control speculation in multi-threaded processors through dynamic loop detection. In:Proceedings of the 4th International Symposium on High-Performance Computer Architecture (HPCA), (1998)

49. McCanne, S., Leres, C., Jacobson, V.: Libpcap. http://www.tcp-dump.org/, (2006)

50. Wojtczuk, R.: Libnids. http://libnids.sourceforge.net/, (2006)51. jt, Libdasm. http://www.klake.org/∼jt/misc/libdasm-1.4.tar.

gz, (2006)52. Apache Chunked Encoding Overflow. http://www.os-

vdb.org/838, (2002)53. Microsoft Windows RPC DCOM Interface Overflow,

http://www.osvdb.org/2100, (2003)54. Microsoft Windows LSASS Remote Overflow, http://www.

osvdb.org/5248, (2004)55. Bell, J.R.: Threaded code. Comm. of the ACM. 16(6),

370–372 (1973)56. Bellard, F.: QEMU, a fast and portable dynamic translator. In:

Proceedings of the USENIX Annual Technical Conference,FREENIX Track, pp. 41–46, (2005)

57. Bhatkar, S., DuVarney, D.C., Sekar, R.: Address obfuscation:an efficient approach to combat a broad range of memoryerror exploits. In: Proceedings of the 12th USENIX SecuritySymposium, (2003)

58. Anagnostakis, K., Sidiroglou, S., Akritidis, P., Xinidis, K.,Markatos, E., Keromytis, A.D.: Detecting targeted attacksusing shadow honeypots. In: Proceedings of the 14th USE-NIX Security Symposium, pp. 129–144, (2005)

59. Hsu, F.-H., Chiueh, T.-C.: CTCP: a transparent centralizedtcp/ip architecture for network security. In: Proceedings ofthe 20th Annual Computer Security Applications Confer-ence (ACSAC), pp. 335–344, (2004)

60. Liang, Z. Sekar, R.: Fast and automated generation of attacksignatures: a basis for building self-protecting servers. In: Pro-ceedings of the 12th ACM conference on Computer and com-munications security (CCS), pp. 213–222, (2005)

61. Dreger, H., Kreibich, C., Paxson, V., Sommer, R.: Enhanc-ing the accuracy of network-based intrusion detection withhost-based context. In: Proceedings of the Conference onDetection of Intrusions and Malware and VulnerabilityAssessment (DIMVA), (2005)

Network-level polymorphic shellcode detection using emulationAbstract IntroductionRelated workStatic analysis resistant polymorphic shellcodeThwarting disassemblyThwarting control and data flow analysisNetwork-level executionApproachPosition-independent codeGetPC codeKnown operand valuesDetection algorithmRunning the shellcodeOptimizing performanceDetection heuristicEnding executionInfinite loop squashingImplementationExperimental evaluationTuning the detection heuristicEvaluation with real trafficEvaluation with synthetic requestsDefining a stricter detection heuristicValidationPolymorphic shellcode executionDetection effectivenessProcessing costLimitationsNon-self-modifying shellcodeNon-self-contained shellcodeEndless loopsTransformations beyond the transport layerConclusion

Network-level polymorphic shellcode detection using …fabian/course_papers/polymorphic-detect.pdfof...

Documents

Transcript of Network-level polymorphic shellcode detection using …fabian/course_papers/polymorphic-detect.pdfof...