Probabilistic Disassembly - GitHub Pages

12
Probabilistic Disassembly Kenneth Miller * , Yonghwi Kwon , Yi Sun * , Zhuo Zhang * , Xiangyu Zhang * , Zhiqiang Lin * Department of Computer Science, Purdue University, West Lafayette, USA Department of Computer Science, University of Virginia, Charlottesville, USA Department of Computer Science and Engineering, Ohio State University, Columbus, USA Abstract—Disassembling stripped binaries is a prominent challenge for binary analysis, due to the interleaving of code segments and data, and the difficulties of resolving control transfer targets of indirect calls and jumps. As a result, most existing disassemblers have both false positives (FP) and false negatives (FN). We observe that uncertainty is inevitable in disassembly due to the information loss during compilation and code generation. Therefore, we propose to model such uncertainty using probabilities and propose a novel disassembly technique, which computes a probability for each address in the code space, indicating its likelihood of being a true positive instruction. The probability is computed from a set of features that are reachable to an address, including control flow and data flow features. Our experiments with more than two thousands binaries show that our technique does not have any FN and has only 3.7% FP. In comparison, a state-of-the-art superset disassembly technique has 85% FP. A rewriter built on our disassembly can generate binaries that are only half of the size of those by superset disassembly and run 3% faster. While many widely- used disassemblers such as IDA and BAP suffer from missing function entries, our experiment also shows that even without any function entry information, our disassembler can still achieve 0 FN and 6.8% FP. I. I NTRODUCTION Analyzing and transforming commercial-off-the-shelf and legacy software have many applications [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], such as bug finding, security hardening, reverse engineering, code clone detection and refactoring. However, they are highly challenging due to the lack of source code. The first fundamental problem is to precisely disassem- ble the software. The seemingly simple task is indeed highly challenging due to the diversity and complexity of compilation and optimizations. There are two popular kinds of disassembly techniques. The first one disassembles instructions following the address order, called linear sweep disassemblers, and the other disassembles instructions by following control flow edges (e.g., jumps and calls), called traversal disassemblers. Both have well known limitations. In particular, code and data can interleave, causing a large number of false positives and even false negatives in linear sweep disassemblers; traversal disassemblers suffer indirect control flow caused by function pointers, virtual tables, and switch-case statements, which make recognizing control transfer targets highly difficult. Even the state-of-the-art disassemblers such as those in BAP [22], IDA-Pro [23], OllyDbg [24], Jakstab [25], SecondWrite [26], and Dyninst [27] have difficulty fully disassembling complex binaries [28]. Some may miss up to 30% of the code [28]. There are machine learning based methods [29] that aim to recognize function entries by instruction patterns (e.g., starting with “push ebp”). However, such methods have inevitable false positives and false negatives (e.g., the entries of many library functions do not follow specific patterns). Recently, superset disassembly [30] was proposed to address these limitations. It disassembles at each address to produce a superset of instructions. A rewriter is built on the disassembler to instrument all superset instructions. While it has a critical guarantee of no false negatives that other binary rewriting tools cannot provide, the rewritten binaries have substantial code size blow-up and nontrivial runtime overhead (e.g., 763% size overhead and 3% runtime overhead on SPEC programs). We argue that the capabilities of reasoning about uncertainty is critical for binary analysis, since it is inherent due to the lack of symbolic information. Our overarching idea is hence to use probabilities to model uncertainty and then perform probabilistic inference to determine the appropriate way of disassembling subject binaries. In particular, our disassembler computes a posterior probability for each address in the code section to indicate the likelihood of the address denoting a true positive instruction (i.e., an instruction generated by the compiler). Specifically, our technique disassembles the binary at each address just like superset disassembly. We call the result the superset instructions or valid instructions, which may or may not be true positives. We then identify correlations between these superset instructions such as one being the transfer target of another; and one defining a register that is later accessed by another. These relations denote semantic features that only the real code body would likely demonstrate. We call them hints. They are uncertain because instructions decoded from random bytes may by chance possess such features. For each kind of hint, we perform apriori probability analysis to determine their prior probabilities. We develop an algorithm to aggregate these hints and compute the poste- rior probabilities. The resulting disassembler has probabilistic guarantees of no false negatives (e.g., the likelihood of missing a true positive instruction is lower than 1 1000 ). In our empirical study with 2, 064 binaries, it never misses any true positive instruction with an appropriate setting. It also has a much smaller number of false positives and much lower overhead in rewriting, compared with superset disassembly. Our contributions are summarized as follows. We propose an innovative idea of probabilistic disas- sembling. The capabilities of reasoning about uncertainty provides unique benefits compared to existing techniques. We identify a set of features for use as disassembly hints

Transcript of Probabilistic Disassembly - GitHub Pages

Page 1: Probabilistic Disassembly - GitHub Pages

Probabilistic DisassemblyKenneth Miller∗, Yonghwi Kwon†, Yi Sun∗, Zhuo Zhang∗, Xiangyu Zhang∗, Zhiqiang Lin‡

∗Department of Computer Science, Purdue University, West Lafayette, USA†Department of Computer Science, University of Virginia, Charlottesville, USA

‡Department of Computer Science and Engineering, Ohio State University, Columbus, USA

Abstract—Disassembling stripped binaries is a prominentchallenge for binary analysis, due to the interleaving of codesegments and data, and the difficulties of resolving controltransfer targets of indirect calls and jumps. As a result, mostexisting disassemblers have both false positives (FP) and falsenegatives (FN). We observe that uncertainty is inevitable indisassembly due to the information loss during compilation andcode generation. Therefore, we propose to model such uncertaintyusing probabilities and propose a novel disassembly technique,which computes a probability for each address in the code space,indicating its likelihood of being a true positive instruction.The probability is computed from a set of features that arereachable to an address, including control flow and data flowfeatures. Our experiments with more than two thousands binariesshow that our technique does not have any FN and has only3.7% FP. In comparison, a state-of-the-art superset disassemblytechnique has 85% FP. A rewriter built on our disassemblycan generate binaries that are only half of the size of thoseby superset disassembly and run 3% faster. While many widely-used disassemblers such as IDA and BAP suffer from missingfunction entries, our experiment also shows that even without anyfunction entry information, our disassembler can still achieve 0FN and 6.8% FP.

I. INTRODUCTION

Analyzing and transforming commercial-off-the-shelf andlegacy software have many applications [1], [2], [3], [4], [5],[6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17],[18], [19], [20], [21], such as bug finding, security hardening,reverse engineering, code clone detection and refactoring.However, they are highly challenging due to the lack of sourcecode. The first fundamental problem is to precisely disassem-ble the software. The seemingly simple task is indeed highlychallenging due to the diversity and complexity of compilationand optimizations. There are two popular kinds of disassemblytechniques. The first one disassembles instructions followingthe address order, called linear sweep disassemblers, andthe other disassembles instructions by following control flowedges (e.g., jumps and calls), called traversal disassemblers.Both have well known limitations. In particular, code and datacan interleave, causing a large number of false positives andeven false negatives in linear sweep disassemblers; traversaldisassemblers suffer indirect control flow caused by functionpointers, virtual tables, and switch-case statements, whichmake recognizing control transfer targets highly difficult. Eventhe state-of-the-art disassemblers such as those in BAP [22],IDA-Pro [23], OllyDbg [24], Jakstab [25], SecondWrite [26],and Dyninst [27] have difficulty fully disassembling complexbinaries [28]. Some may miss up to 30% of the code [28].There are machine learning based methods [29] that aim

to recognize function entries by instruction patterns (e.g.,starting with “push ebp”). However, such methods haveinevitable false positives and false negatives (e.g., the entriesof many library functions do not follow specific patterns).Recently, superset disassembly [30] was proposed to addressthese limitations. It disassembles at each address to produce asuperset of instructions. A rewriter is built on the disassemblerto instrument all superset instructions. While it has a criticalguarantee of no false negatives that other binary rewriting toolscannot provide, the rewritten binaries have substantial codesize blow-up and nontrivial runtime overhead (e.g., 763% sizeoverhead and 3% runtime overhead on SPEC programs).

We argue that the capabilities of reasoning about uncertaintyis critical for binary analysis, since it is inherent due to thelack of symbolic information. Our overarching idea is henceto use probabilities to model uncertainty and then performprobabilistic inference to determine the appropriate way ofdisassembling subject binaries. In particular, our disassemblercomputes a posterior probability for each address in the codesection to indicate the likelihood of the address denoting atrue positive instruction (i.e., an instruction generated by thecompiler). Specifically, our technique disassembles the binaryat each address just like superset disassembly. We call theresult the superset instructions or valid instructions, whichmay or may not be true positives. We then identify correlationsbetween these superset instructions such as one being thetransfer target of another; and one defining a register thatis later accessed by another. These relations denote semanticfeatures that only the real code body would likely demonstrate.We call them hints. They are uncertain because instructionsdecoded from random bytes may by chance possess suchfeatures. For each kind of hint, we perform apriori probabilityanalysis to determine their prior probabilities. We developan algorithm to aggregate these hints and compute the poste-rior probabilities. The resulting disassembler has probabilisticguarantees of no false negatives (e.g., the likelihood of missinga true positive instruction is lower than 1

1000 ). In our empiricalstudy with 2, 064 binaries, it never misses any true positiveinstruction with an appropriate setting. It also has a muchsmaller number of false positives and much lower overheadin rewriting, compared with superset disassembly.

Our contributions are summarized as follows.• We propose an innovative idea of probabilistic disas-

sembling. The capabilities of reasoning about uncertaintyprovides unique benefits compared to existing techniques.

• We identify a set of features for use as disassembly hints

Page 2: Probabilistic Disassembly - GitHub Pages

Disassembler False Negative False PositiveLinear sweep Some SubstantialTraversal [23] Substantial NoneSuperset [30] None BloatedOur method None∗ Some∗∗: with probabilistic guarantees

TABLE I: Comparison of Different Kinds of Disassemblers

and perform static probability analysis to determine theirlikelihood (§III-B).

• We develop a novel inference algorithm that leverages anumber of key characteristics of x86 instruction design(§IV) to aggregate uncertain hints.

• Our experiments on 2, 064 binaries demonstrate that ourtechnique does not have any false negatives, and the falsepositive rate is 3.7%, meaning that it disassembles 3.7%additional instructions that are not true positives. It doesnot miss any instructions even when function entriesare not available, with 6.8% FP. Our evaluation onSPEC Windows PE binaries shows that objdump misses3095 instructions due to code and data interleavings,whereas our tool misses none with 8.12% FP. We alsouse our disassembler in supporting binary writing. Whencompared with the state-of-the-art superset rewritingtechnique [30], our technique reduces the size ofrewritten binary by about 47% and improves the runtimespeed of the rewritten binary by 3%.

II. BACKGROUND AND MOTIVATION

In this section, we use a real world example to explainbinary code disassembly, the limitations of existing work(§II-A), and how we advance the state of the art (§II-B).

A. Binary Code Disassembly

Figure 1(a) presents a snippet from libUbuntuCompo-nents.so in Ubuntu 16.04. In this piece of code, data isinserted in between the code bodies of two functions. In (a),the bytes from 0xbbf72 to 0xbbf8f (in blue) denote data.Address 0xbbf90 denotes the entry of a function. Anotherfunction (omitted from the figure) precedes the data bytes.While the binary is stripped, we acquire the ground truththrough debug symbols from a separate unstripped instance.

Linear Sweep Disassembly. Linear sweep disassemblersdisassemble the next instruction from the bytes right afterthe current instruction. Here, we use objdump. Withoutsymbolic information, objdump cannot recognize the databytes. As a result, after it disassembles the body of thepreceding function, it proceeds to disassemble the data bytesto instructions 0xbbf72, 0xbbf8b, and so on as in Fig-ure 1(b). Specifically, in the shaded area, it considers thethree bytes starting at 0xbbf8f an instruction. Consequently,it misses the true function entry 0xbbf90. Note that theinstruction sequences in Figure 1 are horizontally alignedby their addresses. In addition, objdump disassembles thewrong instruction at 0xbbf92. This illustrates that lineardisassemblers cannot properly handle inter-leavings of dataand instructions. Note that embedding data such as constant

values and jump tables in between code segments is a commonpractice in compilers [28], [31]. As presented in Table I, linearsweep disassemblers have some false negatives (i.e., missinginstructions) and a lot of false positives (i.e., incorrectlydisassembling data bytes as instructions). False negatives areparticularly problematic for binary rewriting as missing evena single instruction could have catastrophic consequences.False positives can cause unnecessary overhead in rewriting,ambiguity in type reverse engineering and so on.

Traversal based Disassembly. Some other diassemblers suchas IDA [23] and BAP [22] disassemble by following controlflow edges, starting from function entries. A prominent chal-lenge is to recognize function entries. Missing an entry meansthe entire function body may not be properly disassembled.The presence of indirect calls makes function entry identi-fication difficult as the precise call targets are only knownat runtime. In our example, there is no direct invocation tothe function entry 0xbbf90 in libUbuntuComponentand the function is not exported either. As a result, IDAmisses the entire function body. Furthermore, the first instruc-tion of the function entry is a rarely used instruction “MOV0x19b978(rip), rax”. As such, ML based techniques(e.g., [32], [29], [33]) likely miss it. There are also non-learning techniques to recognize functions in binaries [34],[35], [36]. They are based on heuristics such as the matchingof push and pop operations at the entry and exit of a function.However, a systematic way to handle the inherent uncertaintyin such heuristics is still in need.

As illustrated by Table I, traversal disassemblers have nofalse positives but potentially substantial false negatives. Infact, Bao et al. [29] show that traversal disassemblers such asIDA may miss 68.19% function entries.

Superset Disassembly. A state-of-the-art technique (partic-ularly for rewriting/instrumentation) is called superset dis-assembly [30]. The idea is to consider that every addressstarts an instruction, called superset instruction. As such,consecutive superset instructions may share common bytes.Rewriting is performed on all superset instructions. It canbe easily inferred that the superset disassembler has nofalse negatives but must have a bloated code body due tothe large number of superset instructions that are not truepositives (Table I). Figure 1(c) presents the results for supersetdisassembly. Observe that a superset instruction is generatedby disassembling the bytes starting at each address. Hence,we have instructions at 0xbbf72, 0xbbf73, ..., 0xbbf91,0xbbf92, and so on. Observe that consecutive instructionsshare common byte values (e.g., the body of 0xbbf91 “8b05 71 b9 19 00” is the suffix of 0xbbf90). Also observethat all the true positive instructions, i.e., those in Figure 1(a),are part of the superset. As such, the rewritten binary canproperly execute as all possible jump/call targets must beinstructions in the superset and hence instrumented. Notethat the bloated instructions cause not only substantial sizeoverhead, but also runtime slowdown because executing each

2

Page 3: Probabilistic Disassembly - GitHub Pages

bbf72: 66 ... 00 NOP cs:(rax, rax)

bbf73: 66 ... 1f NOP cs:(rax+...)

bbf74: 66 ... 84 NOP cs:(rax+...)

bbf8b: 00 00 ADD al, (rax)

bbf8c: 00 00 ADD al, (rax)

bbf8d: 00 00 ADD al, (rax)

bbf8e: 00 00 ADD al, (rax)

bbf8f: 00 48 8b ADD cl, -117(rax)

bbf90: 48 … 00 MOV 0x19b978(rip), rax

bbf91: 8b … 00 MOV 0x19b978(rip), eax

bbf92: 05 … 00 ADD 0x1685873, eax

bbf93 71 b9 JNO bbf5f

bbf94 b9 … 56 MOV 0x56410019, ecx

bbf97: 41 56 PUSH r14

bbf98: 56 PUSH rsi

bbfb0: 48 89 17 MOV rdx, (rdi)

bbfb1: 89 17 MOV edx, (rdi)

...

bbfba: 48 85 ff TEST rdi, rdi

...

bbfd7: 75 ef JNZ bbfc4

bbf72: 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 90 (data)

bbf81: 0f 1f 84 00 00 00 00 00 00 00 00 00 00 00 00 (data)

bbf90: 48 8b 05 71 b9 19 00 MOV 0x19b978(rip), rax

bbf97: 41 56 PUSH r14

bbfa2: 48 8d 50 10 LEA 16(rax), rdx

bbfa6: 48 05 90 00 00 00 ADD 144, rax

bbfac: 48 89 47 10 MOV rax, 16(rdi)

bbfb0: 48 89 17 MOV rdx, (rdi)

bbfba: 48 85 ff TEST rdi, rdi

bbfbd: 74 05 JE bbfc4

bbfd7: 75 ef JNZ bbfc4

bbf72: 66 ... 00 NOP cs:(rax, rax)

...

bbf8b: 00 00 ADD al, (rax)

bbf8d: 00 00 ADD al, (rax)

bbf8f: 00 48 8b ADD cl, -117(rax)

bbf92: 05 … 00 ADD 1685873, eax

bbf97: 41 56 PUSH r14

bbfa2: 48 ... 10 LEA 16(rax), rdx

bbfa6: 48 ... 00 ADD 144, rax

bbfac: 48 ... 10 MOV rax, 16(rdi)

bbfb0: 48 ... 17 MOV rdx, (rdi)

0.04

0.04

0.04

0.04

0.695

0.04

0.695

0.04

0.695

0.04

0.04

0.04

0.04

0.94

0.06

1.0

≈0

1.0

1.0

1

2

3 4

(a) Ground truth (b) Linear sweep (c) Superset disassembly (c) Ours

Fig. 1: Example from libUbuntuComponent.so. Instructions are horizontally aligned by their addresses. The code is slightlymodified for demonstration purposes. In instructions with two operands, the first one is source and the second one is destination.

superset instruction requires a table lookup to determine thelocation of the instrumented version.

B. Our Technique

We aim to inherit the advantages of superset disassembly(i.e., no false negatives) while substantially reducing the falsepositives and achieving much lower overhead. The idea isthat true positives have lots of hints indicating that theyare true instructions. For example, they often have a lot ofdefinition and use (def-use) relations caused by registers andmemory, that is, a register/memory-location is defined at anearlier instruction and then used in a later one. In Figure 1(a),hint 1© indicates a def-use relation caused by register raxbetween instructions 0xbbf90 and 0xbbfa2; 2© by rdx;3© indicates a def-use by the flag bit. Note that false positive

instructions are less likely to induce def-use relations due totheir random nature. For example, instructions at 0xbbf8b-0xbbf8f (Figure 1(c)) define some memory indexed by rax,but there are no corresponding uses. Furthermore, two jumpsto the same target are likely true positives (e.g., hint 4©) asthe chance that random jumps have the same target is small.More hints are discussed in §III-B.

However, hints are uncertain, meaning that false positivesinstructions have a (small) chance of exhibiting such features.For example, according to §III-B, false positive instructionsmay have 1

16 chance to have def-use relation caused by someregister. Hence, the essence of our technique is to associatethese hints with prior probabilities that are derived fromapriori probability analysis, and then perform probabilisticinference to fuse these evidences to form strong confidenceabout true positives. Intuitively, the inference procedure thataggregates prior probabilities is based on the following reason-ing: if a superset instruction is likely to be a true positive, its

control flow descendants are likely to be true positives, and thedifferent superset instructions that share common bytes withit are unlikely true positives. Note that we aim to disassemblebinaries generated by regular compilers so that instructionsdo not have overlapping bodies. For example, the instructionsinvolved in hints 1©- 4© have reachability along control flow(e.g., those in 1© can reach 4©), allowing their probabilities tobe progagated and aggregated. Intuitively, while individually1©- 4© have certain probability (e.g., 1

16 ) to be random, thechance of all of them randomly happening together is verylow. After inference, the posterior probabilities indicate thelikelihood of superset instructions being true positives. Fig-ure 1(d) shows the probabilities computed by our techniquefor each superset instruction. Observe that the true positives(highlighted ones) have large probabilities (some of them arealmost certain such as 0xbbfb0 and 0xbbfba), whereasfalse positives have (very) small probabilities.

Fig. 2: Occlusion does not cascade

III. PROBABILISTIC CHARACTERISTICS OF X86A. Observing Instruction Occlusion

In x86, part of a valid instruction may be another validinstruction and two valid instructions may have overlapping

3

Page 4: Probabilistic Disassembly - GitHub Pages

bodies. We call them occluded instructions. We say a fewbytes form a valid instruction if they can be decoded to aninstruction. A valid instruction may not be a true positiveinstruction. Therefore, if the starting point (e.g., function entry)is not properly recognized, we may have an occluded instruc-tion sequence that differs from the true positive sequence.

Consider an example in Figure 2. Column one shows thecontinuous addresses; column two shows the byte values; andthe remaining columns show different instructions sequenceswhen disassembling starts at different addresses. Note thateach instruction (box) aligns horizontally with its addressesand byte values in the first two columns. Column threeshows the ground truth instruction sequence, in which thefirst four bytes (from 0x400597 to 0x40059a) form a MOVinstruction whereas the following five bytes form another MOVinstruction, followed by a CALL instruction. However, if westart disassembling in the middle of the first instruction, wecould acquire sequences of valid instructions that occlude withthe ground truth, as shown in the remaining columns (i.e.,occluded instructions are in grey). Observe that in columnsfour and five, part of the MOV instruction is decoded to adifferent MOV instruction and a conditional jump instruction,respectively. In the last column, the last byte 0xe0 of the MOVinstruction even groups with the first byte 0xbf of the next(ground truth) instruction to form a valid LOOPNE instruction.

A concern about occlusion is that it may be cascading,meaning that when we start at a wrong place, a large numberof following instructions are consequently occluded. However,researchers have the following observation [37].

(Occlusion Rule): Cascading occlusion is highly un-likely: occluded sequences tend to quickly agree on acommon suffix of instructions.

If one of the sequences is the true positive sequence,occluded sequences quickly converge with the true positive.Consider the example in Figure 2. The three occluded se-quences all converge to the ground truth sequence after oneor two instructions. Intuitively, cascading occlusion is unlikelybecause: two occluded instructions have a good chance toagree on their rears. In other words, the suffix of an instructionis likely to be another instruction. Consider Figure 2. Theoccluded instructions in columns 3 and 4 are the suffices ofthe ground truth MOV instruction. The only exception is thatwhen an occluded instruction i0 (e.g., the LOOPNE instructionin Figure 2 last column) starts at the very end of a validinstruction j0 (e.g., the first MOV in the 3rd column), i0 may gobeyond j0 and cause occlusion in the instruction following j0,say j1 (e.g., the second MOV in the 3rd column). In this case,i0 likely ends in the middle of j1. As such, the instruction(s)following i0 (e.g., the SUB and ADD instructions in the lastcolumn) agree with j1 at the their rear ends. We did a studyon 2064 ELF binaries and found that 99.992% occluded in-struction sequences converge within four instructions. We havealso conducted a formal probability proof from the encodingsof x86 instructions. Our proof shows that for instructions i0,..., ik with n0, ..., nk bytes, respectively. The probability of

1

2

3

Fig. 3: Control flow convergence

an occluded sequence starting inside i0 and not agreeing withthe rear of ik is at most 1

(n0−2)...(nk−2) . With a sequence of7 instructions, each having 5 bytes, the probability that anoccluded sequence does not converge at all is 1

37 = 16561 .

Intuitively, it is analogous to that if two parties cannot settleon a dispute with a small probability p in one round ofnegotiation. The probability that they cannot resolve withinn rounds is pn. The details are elided.

B. Observing Probabilistic Hints for Disassembling

Without knowing the appropriate entries of code segments,we could disassemble at each address and acquire a set ofall valid instructions (or, superset instructions [30]) with onlysome being true positives. Next we discuss a number ofcorrelations between valid instructions that indicate that thecorresponding bytes are not data bytes with high probabilities.We call them probabilistic hints. The occlusion rule and theprobabilistic hints are the two corner stones of our technique.

Hint I: Control Flow Convergence. As shown in the middleof Figure 3 (b), if there are three potential instructions instr1,instr2 and instr3 with instr3 being the transfer target ofboth instr1 and instr2, there is a good chance that theyare not data bytes (but rather instruction bytes). Figure 3(a)shows an example. The bytes starting at 0x804a634 andat 0x804a646 are disassembled to two conditional jumpsA© and B©, respectively, whose target is a same valid in-struction C©. Intuitively, since it is highly unlikely data bytescan form two control transfer instructions and both by chancepoint to the same target, they are likely instruction bytes. Thiscontrol flow relation is often induced by high level languagestructures such as conditional statements (e.g., Figure 3(c)).

Probability Analysis. Assume data byte values have uniformdistribution. Given two valid control transfer instructionsinstr1 and instr2, let instr1’s transfer target be t, which hasthe range of [-27+1, 27-1], [−215+1, 215−1], and [−231+1,231 − 1] for relative, near, and long jumps, respectively. Thelikelihood that instr2 has the same transfer target is hence 1

255 ,1

216−1 , and 1232−1 . In other words, when we see two control

transfer instructions having the same target, the likelihood thatthey are data bytes is (very) low.

Hint II: Control Flow Crossing. As shown in the middleof Figure 4 (b), if there are three valid instructions instr1,instr2 and instr3, with instr2 and instr3 next to each other;instr3 being the transfer target of instr1, and instr2 having

4

Page 5: Probabilistic Disassembly - GitHub Pages

1

2

3

Fig. 4: Control Flow Crossing

a control transfer target different from instr3 (and hencecrossing control flow edges), there is a good chance that theyare not data bytes (but rather instruction bytes). Figure 4 (a)shows an example. Since it is highly unlikely data bytes canform two control transfer instructions with one jumping toright after the other, they are likely instructions. This controlflow relation is often induced by loopy language structures(e.g., Figure 4 (c) with instr1 the loop head, instr2 thelast instruction of the loop body and instr3 the loop exit).The probability analysis is similar to that of control flowconvergence and hence elided.

There are also other control flow related hints. For example,if a valid control transfer instruction i (e.g., a jump) has atarget that does not occlude with the sequence starting fromi, the chance of i denoting data bytes is 1

n , with n theaverage instruction length. This is because a false positivejump (disassembled from random data bytes) may likely jumpto the middle of an instruction. Although this hint is not asstrong as the convergence and crossing hints, a large numberof such hints can be aggregated to form strong indication,through an algorithm described in §IV.

Hint III: Register Define-use Relation. We say a pair ofinstructions instr1 and instr2 have a register define-use (def-use) relation, if instr1 defines the value of a register (or someflag bit) and instr2 uses the register (or the flag bit). InFigure 5(c), there are two def-use relations denoted by thearrows, one induced by register rdx and the other by eax.Another example is that a flag bit is set by a comparisoninstruction, and then used by a following conditional jumpinstruction. Given two valid instructions, if they have def-userelation, they are unlikely data bytes.

Note that false positive instructions often do not have regis-ter def-use although they may demonstrate (bogus) memorydef-use relations. Figure 5(a) presents a snippet of jumptable disassembled to a sequence of instructions. Observethat the first instruction adds al to the memory locationindicated by rax whereas the second instruction adds cl tothe same location. There is a memory def-use between thetwo instructions as the second instruction first reads the valuestored in the location and then performs the addition. However,as we will show in later probability analysis, register def-use is hardly random, but rather caused by register allocation(by compiler). Figure 5(b) presents a snippet of string. It isdisassembled to a sequence of valid instructions too. Observethat there are no register def-use relations.

Probability Analysis. Assume data byte values have uniformdistribution. To simplify our discussion, we further assume

an arbitrary valid instruction has 12 chance to write to some

register or some flag bit (and the other 12 chance writing only

to memory). In contrast, an arbitrary valid instruction readingsome register is much more likely. Note that even a read frommemory often entails reading from register. For example, theinstruction at 0x4005ce in Figure 5(c) performs a memoryread which entails reading rbp. Hence, we make an approx-imation (just for the sake of demonstrating our probabilityanalysis), assuming the likelihood that an instruction readssome register is 0.99. Each instruction has three bits to indicatewhich register is being read/written-to according to the x86instruction reference. As such, given two valid instructionsinstr1 and instr2, they have register def-use with the chanceof 1

2 ×123 = 1

16 . In other words, when we observe def-usebetween two valid instructions, the chance that they denotedata bytes is 1

16 .We need to point out these hints only indicate the corre-

sponding bytes are not data bytes, they do not suggest thevalid instructions are indeed true positives. In other words,they may be occluded instructions that are part of some groundtruth instructions. This is because occluded instructions oftenshare similar features such as the same register operand(s).For instance, bytes “89 c2”, which is the suffix of the firstinstruction in Figure 5 (c), is disassembled to MOV eax,edx, which also has a register def-use with the secondinstruction. However, observing these hints strongly suggeststhat the corresponding bytes are instruction bytes. Fortunately,the aforementioned occlusion rule dictates that even thereis occlusion, it will soon be automatically corrected. Ourdisassembly technique is hence built on this observation.

Besides the register def-use hint, we have other hints thatdenote data flow related program semantics. For example, aninstruction saving a register to a memory location followedby another instruction that defines the register correspondsto register spilling [38], which can hardly be random. Wealso consider memory def-use between instructions of differentopcodes. Details are elided.

IV. PROBABILISTIC DISASSEMBLING ALGORITHM

As discussed in the previous section, when a probabilistichint is observed, we have certain confidence that the cor-responding bytes are not data bytes but rather instructionbytes, although we are still uncertain if they are true positiveinstructions as their occluded peers may have similar proper-ties as well. The occlusion rule dictates that a sequence thatstarts with some occluded instruction can quickly correct itselfand converge on true positive instructions. Therefore in ourmethod, we consider an instruction is likely a true positive ifmultiple sequences with a large number of hints converge onthe instruction. Here, a sequence starting from an instructioni is acquired by following the control flow (e.g., if i is aunconditional jump, the next instruction in the sequence wouldbe the target of the jump). We say multiple sequences convergeon an instruction if it occurs in all of them.

Specifically, let a hint h have a prior probability p being databyte, with p computed by the analysis in the previous section.

5

Page 6: Probabilistic Disassembly - GitHub Pages

Fig. 5: Register Definition-use Relation

Fig. 6: Example for the algorithm; the code snippet in foo() corresponds to a statement “for (i=0;i<11;i++) ...”

Since the following instructions are acquired strictly followingthe control flow semantics, they inherit the probability p.Intuitively, if j is the next instruction of h along control flow,j’s probability of being some data byte is equal to or smallerthan p. When the sequences starting with multiple hints h1,h2, ... hn converge on an instruction i, the probability ofi representing data byte is D[i] = p1 × p2 × ... × pn. Assuch, when a large number of hints converge on i, i is highlyunlikely a data byte.

However, a small D[i] does not necessarily denote that iis a true positive instruction. We then leverage the exclusionproperty of a true positive instruction, that is, if i is a truepositive instruction, all the other valid instructions occludingwith i must not be true positive instructions1.Therefore, wecompute the likelihood of i being a true positive instructionby conducting normalization with all the instructions occludedwith i. Intuitively, if i is the only one that has a very small D[i]compared to all the occluded instructions, i is highly likely truepositive. If there are occluded instructions whose D valuesare comparable to D[i], we cannot be certain that i is truepositive. In this case, we keep all these instructions just likesuperset disassembly. However, the key point is that due to theocclusion rule, sequences quickly converge on true positivessuch that the occluded peers of the converged true positives arenot reachable by any sequences and hence receive no hints. Assuch, the true positives stand out in most cases, the exceptionbeing very short and featureless code segments. According toour experiments (see §V), our technique never misses any true

1This property may not hold in manually crafted binaries in which thedeveloper purposely introduces occlusion between true positive instructions.However, we focus on binaries generated by compilers in this paper.

positive and has as low as 3.7% false positives. In comparison,the false positive rate of superset disassembly is 85%.

Algorithm Details. Algorithm 1 takes as input a binary Bwhich is an array of bytes indexed by address; a list of hints Hwith H[i] = p meaning that i is a hint with a prior probabilityp (of being data bytes). It produces posterior probabilities Pwith P [i] the likelihood that i being a true positive instruction.Within the algorithm, we use D[i] to denote the probability ibeing a data byte and RH[i] to denote the set of hints thatreach i, each hint represented by its address.

In lines 1-6, the algorithm initializes all the D values andall the RH values. If the bytes starting at i denote invalidinstruction, D[i] is set to 1.0, otherwise ⊥ to denote that wedo not have any knowledge. Note that some byte sequencescannot be disassembled to any valid instruction.

Due to the loopy structures in binary, the algorithm is overalliterative, and terminates when a fix point is reached. Theiterative analysis is in lines 8-30 with variable fixed pointused to determine termination. The analysis consists of threesteps: forward propagation of hints (lines 10-21), local prop-agation within occlusion space (lines 22-24), and backwardpropagation of invalidity (lines 25-30). The first step traversesfrom the beginning of B to the end, propagating/collectinghints and computing the aggregated probabilities. It leveragesthe following forward inference: (1) the control flow successorof a (likely) instruction is also a (likely) instruction. Otherwise,the program is invalid because its execution would lead toexception (caused by the invalid instruction) following thecontrol flow. The second step is to propagate the computedprobability for each instruction i to its occlusion space con-sisting of all the other addresses that can be decoded into

6

Page 7: Probabilistic Disassembly - GitHub Pages

Algorithm 1 Probabilistic DisassemblingInput: B - binary indexed by address

H - probabilistic hints, denoted by a mapping froman address to a prior probability

Output: P [i] - posterior probability of an address i denotinga true positive instruction

Variable: D[i] - probability of address i being data byteRH[i] - the set of hints, denoted by a set of addresses,that reach an address i

1: for each address i in B do2: if invalidInstr(i) then3: D[i] ← 1.04: else5: D[i] ← ⊥6: RH[i] ← {}7: fixed point ← false8: while !fixed point do9: fixed point ← true

. Forward propagation of hints (Step I)10: for each address i from start of B to end do11: if D[i] ≡ 1.0 then12: continue13: if H[i] 6= ⊥ and i 6∈ RH[i] then14: RH[i] ← RH[i] ∪ {i}15: D[i]← Πh∈RH[i]H[h]

16: for each n, the next instruction of i along control flow do17: if RH[i]−RH[n] 6= {} then18: RH[n] ← RH[n] ∪RH[i]19: D[n]← Πh∈RH[n]H[h]20: if n < i then21: fixed point ← false

. Propagation to occlusion space (Step II)22: for each address i from start of B to end do23: if D[i] ≡ ⊥ and ∃j occluding with i, s.t. D[j] 6= ⊥ then24: D[i] ← 1-minj occludes with i(D[j])

. Backward propagation of invalidity (Step III)25: for each address i from end of B to start do26: for each p, the preceding instruction of i along control flow do27: if D[p] ≡ ⊥ or D[p] < D[i] then28: D[p] ← D[i]29: if p > i then30: fixed point ← false

. Compute posterior probabilities by normalization31: for each address i from start of B to end do32: if D[i] ≡ 1.0 then33: P [i] ← 034: continue35: s ← 1

D[i]36: for each address j, representing an instruction occluded with i do37: s ← s + 1

D[j]

38: P [i] ← 1/D[i]s

instructions occluding with i. It is to leverage the followinglocal inference: (2) an instruction being likely renders all theother instructions in its occlusion space unlikely. The thirdstep traverses each address from the end to the beginning andpropagates invalidity of instructions. It leverages the followingbackward inference: (3) when an instruction i is unlikely, allthe instructions that reach i through control flow are unlikely.Intuitively, it is the logical contrapositive of the forwardinference rule (1). The first step can be considered to identifyinstruction bytes, whereas the second and third steps are toidentify data bytes.

Step I. In lines 13-15, if i denotes a hint and i has not beenadded to RH[i], it is added to RH[i] and D[i] is updated to

the product of the prior probabilities of all the hints in RH[i](line 15). In lines 16-21, the algorithm propagates the hints inRH[i] to i’s control flow successor(s). Particularly, if RH[i]has some hint that the successor n does not have (line 17), thehints of i are propagated to RH[n] by a union operation (line18), and D[n] is updated. In lines 20-21, if the successor n hasa smaller address so that it has been traversed in the currentround, the analysis needs another round to further propagatethe newly identified hint(s).

Step II. In lines 22-24, the algorithm traverses all the addressesand performs local propagation of probabilities within occlu-sion space of individual instructions. Particularly, for eachaddress i, it finds its occluded peer j that has the minimalprobability (i.e., the most likely instruction). The likelihoodof i being data is hence computed as 1−D[j] (line 24).

Step III. Lines 25-30 traverse from the end to the beginning.For each address i, if its control flow predecessor p does nothave any computed probability or has a smaller probability(line 27), which intuitively means that we have more evidencethat i is data (instead of instruction), then we set p to havethe same level of confidence of denoting data bytes (line 28).In the extremal case, if D[i] ≡ 1.0, D[p] must be 1.0 too.If p has a larger address than i and hence p must have beentraversed, variable fixed point is reset and the analysis willbe conducted for another round (lines 29-30).

Note that the control flow successors and predecessors areimplicitly computed along the analysis. Our analysis does notrequire correctly recognizing indirect jump and call targets,which is a very difficult challenge. In other words, even thoughsuch control flow relations are missing, our technique canstill collect enough hints from (disconnected) code blocks todisassemble correctly. In §V-D, we show that our techniquecan disassemble without any function entry information with0 false negatives and only 6.8% false positives.

After the iterative process, lines 31-38 compute the posteriorprobabilities for true positive instructions by normalization. Ifan instruction starting at i is invalid, P [i] is set to 0 (lines32-33). Otherwise, it sums up the inverse of probability D forall the instructions occluded with i, including i itself, to s;then P [i] is computed as the ratio between 1

D[i] and s.

Example. Consider an example in Figure 6. It is much simplerthan the one in §II and allows easy explanation. The large boxon the left shows a code snippet denoting the beginning of afunction foo() (from 0x40058c to 0x400594) and partof the function body (from 0x4005c0 to 0x4005e1) corre-sponding to a simple loop “for (i=0;i<11;i++)...”.The code snippet is preceded by data bytes that stand forconstant strings (from 0x40057b to 0x40058b). The stringsare disassembled to valid instructions. Note that symbolicinformation is not available, we mark the function entryand strings just for explanation purpose. Boxes A©- F© onthe right stand for sequences starting from some occludedinstructions. The instructions in the grey background denoteocclusions whereas instructions without background denotethe converged ones, which are horizontally aligned with the

7

Page 8: Probabilistic Disassembly - GitHub Pages

corresponding instructions in the leftmost box. For example,in box A©, disassembling at 0x40057c causes occlusion upto 0x400583. In the following, we show how our algorithmcomputes the probabilities for true positives.

During preprocessing, our technique collects the hints andtheir prior probabilities. Each circled number denotes such ahint (only part of the hints are shown). For example, 1© is aregister-def-use hint (hint III in §III-B) due to rdi. Accordingto §III-B, the prior probability is 1

16 (being a data byte). Notethat this hint actually occurs in the data bytes. In addition,2© and 3© stand for the register-spilling (i.e., backup and then

update) hint due to rbp and rsp, respectively; 4© stands forregister-def-use; 5© stands for control-flow-crossing (hint IIin §III-B); and 6© stands for memory-def-use. None of theoccluded sequences provide any additional hints.

Initially, D[0x400583] = D[0x40057e] = 1.0 and allother D values are ⊥. In step I, hints are collected andprobabilities are computed in a forward fashion. Hint 1©cannot be propagated to address 0x400584 due to the badinstruction at 0x400583 and the sequences in A© and D© donot provide any hint, hence D[0x400584] = ⊥. Its occludedpeers in 0x400585-0x400588 have the same D value.

In contrast, D[0x40058c] = 116 due to the hint 2©. Similarly,

D[0x40058d] = ( 116 )

3 due to the three hints it is involved in.As shown in boxes B©, its occluded peer 0x40058e cannotbe reached from 0x40058c. As a result, it gets no hint andD[0x40058e] = ⊥. Similarly D[0x40058f ] = ⊥. Let us skipa few instructions and consider 0x4005db. Due to the loop(with the backedge 0x4005df→0x4005c2), hints 2©- 6©all reach 0x4005db. As such, D[0x4005db] is a tiny valuesmaller than 1

232 . In contrast, as shown in C© and F©, no hintscan reach its occluded peers 0x4005dc and 0x4005dd andtheir D values remain ⊥. Through step II of local propagationin occlusion space, D[0x40058f ] = D[0x40058e] = 1 − 1

163

and D[0x4005dc] = D[0x4005dd] ' 1.In step III, the invalidity information is propagated back-

ward. That is, if an address has a larger D value than itspredecessor, the predecessor inherits that D value. Specifi-cally, 0x400583 being invalid invalidates all its control flowpredecessors including 0x400582, 0x400581, 0x40057f,and 0x40057b. That is, their D values equal to 1.0.

In contrast, 0x40058d has two possible predecessors,“0x40058b: 4c 55 rex.WR PUSH rbp” (not shownin the code snippet) and “0x40058c: 55 PUSH rbp”(shown in the code snippet). The former has the prefix“rex” that is only used in the long mode [39] and hencedoes not form any hint with other instructions. Furthermore,it occludes with 0x40058c. As a result, D[0x40058b] =1 − D[0x40058c] = 15

16 after steps I and II. However, sinceD[0x40058d] = 1

163 , which is smaller than D[0x40058b],there is no backward propagation. Although D[0x40058e] =1 − 1

163 is a large value, it does not have any control flowpredecessor, that is, it cannot be reached by disassembling atany preceding addresses.

After the iterative process, the D values arenormalized to compute the posterior probabilities.

For example, since 0x40058c only occludes with0x40058b and D[0x40058b] = 15

16 , D[0x40058c] = 116 .

P [0x40058c] = 1616+16/15 = 0.94 and P [0x40058b] = 0.058.

The other true positive instructions have higher than 0.99probabilities. For instance, P [0x40058d] ' 0.9987 andP [0x40058e] = P [0x40058f ] ' 0.0006. P [0x4005db] ' 1.0and P [0x4005dc], P [0x4005dd] are negligible.

V. IMPLEMENTATION AND EVALUATION

We have implemented a prototype on top of BAP [22] usingOCaml. Our implementation has 5, 546 LOC. To evaluateour technique, we use two sets of benchmarks. The first setcontains 2, 064 x86 ELF binaries collected from the BAPcorpora [22]. The size of these binaries ranges from 100KB to3MB. They come with symbolic information, from which wederive the ground truth. We stripped the binaries before ap-plying our disassembler. The second set is the SPEC2006INTprograms. We used SPEC for the comparison with super setdisassembly [30]. All the experiments were run on a machinewith Intel i7 CPU and 16 GB RAM. Our evaluation addressesthe following research questions (RQ).

• RQ1: Can our technique disassemble binaries with accu-racy, completeness, and efficiency (§V-A)?

• RQ2: How does our technique compare with a state ofthe art super set disassembly (§V-B)?

• RQ3: How does our technique perform when data andcode are interleaved, in comparison with linear sweepdisassembly (§V-C)?

• RQ4: How does our technique perform when no functionentry information is available (e.g., for indirect functionsthat are one of the most difficult challenges for traversaldisassemblers in IDA [23] and BAP [22]) (§V-D)?

A. RQ1: Effectiveness and Efficiency

To answer RQ1, we perform four experiments: (1) measurefalse negatives (missing true positive instructions) and falsepositives (bogus instructions) on the 2, 064 binaries; (2) mea-sure the disassembling time; (3) analyze the contributions ofeach individual kind of hints; (4) study the effect of differentprobability threshold settings.

FP and FN. We report the results with the probability thresholdof P >= 0.01, meaning that we are very conservative andhence keep all the valid instructions with more than 0.01 com-puted posterior probability. In this setting, our technique doesnot have any false negatives. Figure 7 shows the correlationsbetween binary size and the FP rate. Observe that most casescluster at bottom-left. Most medium to large binaries havelower than 5% false positives. The a few largest (on the right)are even lower than 2%. The ones with larger FP rates tendto be small binaries, which have fewer hints. The average FPrate is only 3.7%. This strongly suggests the effectiveness ofour technique.

Disassembling Time. Figure 8 shows the distribution of time.Observe that it has a close-to-linear relation with the binarysize. The largest ones take about 10 minutes to disassemble.

8

Page 9: Probabilistic Disassembly - GitHub Pages

0

2

4

6

8

10

12

14

16

18

0 500000 1000000 1500000 2000000 2500000 3000000

Fa

lse

Po

siti

ve

(%

)

Binary Size (byte)

Fig. 7: Binary Size and FP rate

0

100

200

300

400

500

600

700

800

900

0 500000 1000000 1500000 2000000 2500000 3000000

Tim

e (

s)

Binary Size (byte)

Fig. 8: Size and Processing Time

The medium ones take 4-8 minutes. Our algorithm is not asfast as other disassemblers because it is an iterative algorithmbased on probabilistic inference. Also, we have not optimizedimplementation. We argue that since disassembling is one-timeeffort, the cost is justifiable.

Contributions of Different Kinds of Hints. Figure 9 shows theresults for three settings: using only the control flow hints;only the data flow hints (e.g., def-use and register-spilling);and using all hints. The x axis denotes intervals of the FP rateand the y axis represents the number of binaries that fall intoan interval. For example, with only control flow hints, about300 binaries have less than 1% FPs; with only data flow hints,about 70 binaries have less than 1% FPs; with all hints, thenumber is 510. In other words, both types of hints are criticalfor getting the best results.

Effects of Different Probability Thresholds. As mentioned

0

100

200

300

400

500

600

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

# o

f b

ina

rie

s

False Positive Rate (%)

Control flow All Data flow

Fig. 9: Distribution of Different Kinds of Hints

Fig. 10: Tradeoffs of Threshold Setting

earlier, we retain instructions whose computed probabilityP >= α. Figure 10 shows how the FP, FN rates (on the righty axis) and the percentage of precisely disassembled functions(on the left y axis) change with α (the x axis). For example,at the starting point on the left is α = 0.67% (i.e., we keepinstructions with P >= 0.0067), FP is about 4% and FN is0, and 53.23% of the 607,758 functions in the corpora areprecisely disassembled. With the growth of α, FP drops, FNand the rate of precisely disassembled functions rise. At theother end on the right is α = 20%, FP is 0.6% whereas FN is6.7%. Almost 73% of functions are precise.

B. RQ2: Comparison with Superset Disassembly

Linear sweep and traversal disassemblers suffer false nega-tives, which may cause serious problems in binary rewriting.Superset disassembler [30] is a state-of-the-art that does nothave false negatives. However, it introduces lots of falsepositives, leading to size blowup in rewriting and unnecessaryruntime overhead. Table II shows the comparison with supersetdisassembly. To compare the effects on binary rewriting, weintegrate our disassembler with their rewriter. We use the sameSPEC programs in [30] (column one). Columns 2-4 presentthe FP rate, the code size blowup after rewriting, and theexecution time variation after rewriting, respectively. Here, wedo not add any instructions during rewriting. Columns 5-7present the same information for our technique. Observe thatwe reduce the size blowup from 763% to 404% and improvethe execution time by 3%. Note that it is normal that rewrittenbinaries may execute faster than the original code [30] due tothe cache behavior changes caused by rewriting. Note thatalthough our technique still has 404% size inflation, it isbecause the rewriter uses a huge lookup table to translate eachaddress in the code space. While all the entries are necessaryin superset rewriting, majority of these entries are not neededin our rewriter, and therefore empty. We plan to remove theempty table entries and replace it with a cost-effective hashtable in the future. The FP rate differences (columns 2 and 5)indicate the large number of these redundant entries.

C. RQ3: Handling Data and Code Interleavings

A prominent challenge in disassembly is to handle dataand code interleavings (i.e., the presence of read-only datain between code segments), which could cause false negatives

9

Page 10: Probabilistic Disassembly - GitHub Pages

Superset Disassembly Probabilistic DisassemblyProgram FP Size (rewritten/orig) Exec. time (rewritten/orig) FP Size (rewritten/orig) Exec. time (rewritten/orig)

400.perlbench 85.32% 780% 116.71% 11.29% 427% 117.74%401.bzip2 84.65% 779% 105.49% 6.57% 400% 97.30%403.gcc 88.03% 751% 104.60% 11.33% 409% 101.71%429.mcf 84.72% 749% 104.02% 4.60% 399% 104.74%

445.gobmk 90.27% 727% 103.43% 6.20% 372% 97.30%456.hmmer 82.71% 779% 99.14% 6.64% 411% 94.12%458.sjeng 87.08% 756% 98.83% 7.61% 407% 92.76%

462.libquantum 80.96% 758% 100.42% 4.04% 400% 96.94%464.h264ref 82.36% 781% 100.39% 2.41% 395% 94.57%471.omnetpp 85.02% 768% 105.24% 9.82% 420% 108.4%

473.astar 81.46% 761% 94.28% 3.90% 402% 93.24%Avg 84.8% 763% 103.0% 6.8% 404% 99.9%

TABLE II: Superset Disassembly vs Probabilistic Disassembly

0

50

100

150

200

250

300

350

400

450

500

Nu

mb

er

of

Bin

ari

es

FP Rate Interval

Fig. 11: FP Rates In the Absence of Function Entries

in linear sweep disassembly. In this experiment,, we compileSPECint 2000 benchmark by Visual Studio 2017 with differentoptimization levels to generate a set of binaries. We extractground truth from pdb files. We use both objdump, a linearsweep disassembler, and our disassembler to disassemble thestripped binaries. The comparison between the disassembledresults and the ground truth shows that objdump misses 3095instructions in total, whereas our tool misses none. The averageFP rate of our tool is 8.12% (5.95%, 8.84%, 5.76%, and 8.98%for optimization levels O1, O2, Od, and Ox, respectively).The FP rate is higher than ELF binaries as data and codeinterleavings are more common in PE binaries.

D. RQ4: Handling Missing Function Entries

Another prominent challenge, especially for traversal dis-assembly, is missing function entries due to indirect calls. Tosimulate such challenges, we eliminate all the function relatedhints, such as call edges that have the same target (part of thehint I). In other words, we only leverage the intra-proceduralhints to disassemble. Figure 11 presents the results, with xaxis the FP interval and y axis the number of binaries. Theaverage FP rate is 6.8%, slightly higher than that of using bothinter- and intra-procedural hints. FN is still 0. This indicatesthat in the cases where traversal disassemblers such as IDAand BAP have troubles due to missing function entries, ourtechnique has substantial advantages.

VI. RELATED WORK

We have discussed existing disassembly techniques in §II.In this section, we discuss other related works. Probabilis-tic inference has been used in program analysis, such aslocating software faults [40], inferring explicit informationflow [41], and recognizing memory objects [42]. But to ourbest knowledge, we are the first one to use it in binary

disassembly. Machine learning has been used for binaryanalysis. For instance, Wartell et. al. [43] used a statisticalcompression technique to differentiate code and data. ShingledGraph Disassembly [44] leverages graph model based learningon a large corpus of binaries to recognize data bytes. Ourtechnique does not require training. Its formalization of usinga random variable to represent each address, the introductionof hints and the fusion of these hints are unique. Dynamicdisassembly (e.g., [45], [46], [27], [1], [47]) disassemblesduring execution. These approaches impose extra runtimeoverhead. In addition, they can hardly serve downstream staticanalysis such as dependence analysis. Disassembly has manyapplications, such as binary hardening [6], [48], [49], [50],[5], deobfuscation [51], [52], reassemble disassembling [53],[54], [55], reverse engineering [56], and exploitation [57]. Ourwork is particularly suited in rewriting and hardening.

VII. THREATS TO VALIDITY

Although we used the corpus from BAP and SPEC in ourexperiments, the benchmarks may not represent all featuresof real-world binaries. We will test our technique on morebinaries. We focus on binaries generated by compilers. It isunclear how our technique will perform on obfuscated codealthough we believe semantic hints still exist in such code.

VIII. CONCLUSION

We propose a novel probabilistic disassembling techniquethat can properly model the uncertainty in binary analysis. Itcomputes a probability for each address in the code space,indicating the likelihood of the address representing a truepositive instruction. The probability is computed by fusinga set of uncertain features that can reach the address. Theresults show that our technique produce no false negativesand as low as 3.7% false positives; and it substantiallyoutperforms a state-of-the-art superset disassembly technique.

ACKNOWLEDGMENT

The authors were supported in part by ONRN000141410468, N000141712947, and N000141712995,DARPA FA8650-15-C-7562, NSF 1748764, 1409668,1834213, and 1834215, AFOSR FA95501410119, and SandiaNational Lab under award 1701331.

10

Page 11: Probabilistic Disassembly - GitHub Pages

REFERENCES

[1] D. L. Bruening, Efficient, transparent, and comprehensive runtime codemanipulation. PhD thesis, Massachusetts Institute of Technology, 2004.

[2] M. Smithson, K. Anand, A. Kotha, K. Elwazeer, N. Giles, and R. Barua,“Binary rewriting without relocation information,” tech. rep., U. Mary-land, 2010.

[3] M. P. T. cker Chiueh, A Binary Rewriting Defense against Stack basedBuffer Overflow Attacks. SUNY Stony Brook, 2003.

[4] A. M., B. M., E. U., and L. J.;, Control-flow integrity: principles,implementations, and applications. University of California, Santa Cruz;Microsoft Research.

[5] L. Davi, A. Dmitrienko, M. Egele, T. Fischer, T. Holz, R. Hund,S. Nurnberger, and A.-R. Sadeghi, MoCFI: A framework to mitigatecontrol-flow attacks on smartphones. University of California, SantaBarbara, 2012.

[6] R. Wartell, V. Mohan, K. Hamlen, and Z. Lin, “Securing untrustedcode via compiler-agnostic binary rewriting,” in Proceedings of the28th Annual Computer Security Applications Conference (ACSAC’12),(Orlando, FL), December 2012.

[7] X. Chen, A. Slowinska, D. Andriesse, H. Bos, and C. Giuffrida,“Stackarmor: Comprehensive protection from stack-based memory errorvulnerabilities for binaries.,” in NDSS, 2015.

[8] V. Van der Veen, D. Andriesse, E. Goktas, B. Gras, L. Sambuc,A. Slowinska, H. Bos, and C. Giuffrid, “Practical context-sensitivecfi,” in Proceedings of the 22nd ACM Conference on Computer andCommunications Security (CCS), (Denver, Colorado), ACM, October2015.

[9] V. van der Veen, E. Goktas, M. Contag, A. Pawlowski, X. Chen,S. Rawat, H. Bos, T. Holz, E. Athanasopoulos, and C. Giuffrida, “Atough call: Mitigating advanced code-reuse attacks at the binary level,”in Proceedings of the 37th IEEE Symposium on Security and Privacy(Oakland), (San Jose, CA, USA), IEEE, May 2016.

[10] X. Chen, H. Bos, and C. Giuffrida, “CodeArmor: Virtualizing the CodeSpace to Counter Disclosure Attacks,” in EuroS&P, Apr.

[11] L. Li and C. Wang, “Dynamic analysis and debugging of binary codefor security applications,” in RV’13.

[12] A. Sæbjørnsen, J. Willcock, T. Panas, D. Quinlan, and Z. Su, “Detectingcode clones in binary executables,” in Proceedings of the EighteenthInternational Symposium on Software Testing and Analysis (ISSTA’09).

[13] F. Peng, Z. Deng, X. Zhang, and Z. S. Dongyan Xu, Zhiqiang Lin,“X-force: Force-executing binary programs for security applications,”in USENIX Security Symposium, 2014.

[14] I. G. T. M. A. R. Guanhua Wang, Sudipta Chattopadhyay, “oo7: Low-overhead defense against spectre attacks via binary analysis,” in CoRRabs/1807.05843, 2018.

[15] M. Bhme, V.-T. Pham, and A. Roychoudhury, “Coverage-based greyboxfuzzing as markov chain,” in ACM Conference on Computer andCommunications Security (CCS’16).

[16] V.-T. Pham, M. Bhme, and A. Roychoudhury, “Model-based whiteboxfuzzing for program binaries,” in ASE’16.

[17] V.-T. Pham, W. B. Ng, K. Rubinov, and A. Roychoudhury, “Hercules:Reproducing crashes in real-world application binaries,” in ICSE’15.

[18] D. Gopan, E. Driscoll, D. Nguyen, D. Naydich, A. Loginov, andD. Melski, “Data-delineation in software binaries and its applicationto buffer-overrun discovery,” in Proceedings of the 37th InternationalConference on Software Engineering (ICSE’15).

[19] Z. Xu, B. Chen, M. Chandramohan, Y. Liu, and F. Song, “Spain: Securitypatch analysis for binaries towards understanding the pain and pills,” inICSE’17.

[20] E. Tilevich and Y. Smaragdakis, “Binary refactoring: Improving codebehind the scenes,” in Proceedings of the 27th International Conferenceon Software Engineering (ICSE’05).

[21] N. Rosenblum, B. P. Miller, and X. Zhu, “Recovering the toolchainprovenance of binary code,” in Proceedings of the 2011 InternationalSymposium on Software Testing and Analysis (ISSTA’11).

[22] I. Gotovchits, D. Brumley, J. Bosamiya, and R. V. Tonder, “Binaryanalysis platform.”

[23] Hex-Rays, The Interactive Disassembler.[24] O. Yuschuk, “Ollydbg,” http://www. ollydbg. de/, 2007.[25] J. Kinder and H. Veith, “Jakstab: A static analysis platform for binaries,”

in International Conference on Computer Aided Verification, pp. 423–427, Springer, 2008.

[26] M. Smithson, K. Anand, A. Kotha, K. Elwazeer, N. Giles, and R. Barua,Second Write: Binary Rewriting without Relocation Information. Uni-versity of Maryland, 2010.

[27] A. R. Bernat and B. P. Miller, “Anywhere, any-time binary instrumen-tation,” in Proceedings of the 10th ACM SIGPLAN-SIGSOFT workshopon Program analysis for software tools, pp. 9–16, ACM, 2011.

[28] X. Meng and B. P. Miller, “Binary code is not easy,” in Proceedingsof the 25th International Symposium on Software Testing and Analysis,pp. 24–35, ACM, 2016.

[29] T. Bao, J. Burket, M. Woo, R. Turner, and D. Brumley, BYTEWEIGHT:Learning to Recognize Functions in Binary Code. Carnegie MellonUniversity; University of Chicago, 2014.

[30] E. Bauman, Z. Lin, and K. Hamlen, “Superset disassembly: Staticallyrewriting x86 binaries without heuristics,” in Proceedings of the 25thAnnual Network and Distributed System Security Symposium (NDSS’18),(San Diego, CA), February 2018.

[31] D. Andriesse, X. Chen, V. van der Veen, A. Slowinska, and H. Bos,“An in-depth analysis of disassembly on full-scale x86/x64 binaries.,”in USENIX Security Symposium, pp. 583–600, 2016.

[32] N. E. Rosenblum, X. Zhu, B. P. Miller, and K. Hunt, “Learning toanalyze binary computer code.,” in AAAI, pp. 798–804, 2008.

[33] E. C. R. Shin, D. Song, and R. Moazzezi, “Recognizing functionsin binaries with neural networks.,” in USENIX Security Symposium,pp. 611–626, 2015.

[34] D. Andriesse, A. Slowinska, and H. Bos, “Compiler-agnostic functiondetection in binaries,” in Security and Privacy (EuroS&P), 2017 IEEEEuropean Symposium on, pp. 177–189, IEEE, 2017.

[35] R. Qiao and R. Sekar, “Function interface analysis: A principledapproach for function recognition in cots binaries,” in Dependable Sys-tems and Networks (DSN), 2017 47th Annual IEEE/IFIP InternationalConference on, pp. 201–212, IEEE, 2017.

[36] R. Qiao, Accurate Recovery of Functions in COTS Binaries. PhD thesis,State University of New York at Stony Brook, 2017.

[37] C. Linn and S. Debray, “Obfuscation of executable code to improveresistance to static disassembly,” in Proceedings of the 10th ACMConference on Computer and Communications Security (CCS’03), 2003.

[38] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman, Compilers: Principles,Techniques, and Tools (2Nd Edition). Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 2006.

[39] P. Guide, “Intel R© 64 and ia-32 architectures software developers man-ual,” Volume 3B: System programming Guide, Part, vol. 2, 2011.

[40] G. K. Baah, A. Podgurski, and M. J. Harrold, “The probabilistic programdependence graph and its application to fault diagnosis,” in Proceedingsof the 2008 International Symposium on Software Testing and Analysis,ISSTA ’08, (Seattle, WA, USA), pp. 189–200, ACM, 2008.

[41] B. Livshits, A. V. Nori, S. K. Rajamani, and A. Banerjee, “Merlin:Specification inference for explicit information flow problems,” inProceedings of the 30th ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation, PLDI ’09, (Dublin, Ireland),pp. 75–86, ACM, 2009.

[42] Z. Lin, J. Rhee, C. Wu, X. Zhang, and D. Xu, “Dimsum: Discoveringsemantic data of interest from un-mappable with confidence,” in Pro-ceedings of the 19th Annual Network and Distributed System SecuritySymposium (NDSS’12), (San Diego, CA), February 2012.

[43] R. Wartell, Y. Zhou, K. W. Hamlen, M. Kantarcioglu, and B. Thurais-ingham, “Differentiating code from data in x86 binaries,” in Proceedingsof the European Conference on Machine Learning and Principlesand Practice of Knowledge Discovery in Databases (ECML PKDD)(D. Gunopulos, T. Hofmann, D. Malerba, and M. Vazirgiannis, eds.),vol. 3, (Athens, Greece), pp. 522–536, September 2011.

[44] R. Wartell, Y. Zhou, K. W. Hamlen, and M. Kantarcioglu, “Shingledgraph disassembly: Finding the undecideable path,” in Pacific-AsiaConference on Knowledge Discovery and Data Mining, pp. 273–285,Springer, 2014.

[45] D. Bruening, T. Garnett, and S. Amarasinghe, “An infrastructure foradaptive dynamic optimization,” in Code Generation and Optimization,2003. CGO 2003. International Symposium on, pp. 265–275, IEEE,2003.

[46] H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi,“Pinpointing representative portions of large Intel Itanium programswith dynamic instrumentation,” in Proc. 37th IEEE/ACM InternationalSym. on Microarchitecture, pp. 81–92, 2004.

[47] Nanda, W. Li, and L.-C. Lam, BIRD: binary interpretation using runtimedisassembly. SUNY, Stony Brook, 2006.

11

Page 12: Probabilistic Disassembly - GitHub Pages

[48] M. Smithson, K. ElWazeer, K. Anand, A. Kotha, and R. Barua, “Staticbinary rewriting without supplemental information: Overcoming thetradeoff between coverage and correctness,” in 2013 20th WorkingConference on Reverse Engineering (WCRE), pp. 52–61, IEEE, 2013.

[49] M. Wang, H. Yin, A. V. Bhaskar, P. Su, and D. Feng, “Binary codecontinent: Finer-grained control flow integrity for stripped binaries,”in Proceedings of the 31st Annual Computer Security ApplicationsConference, pp. 331–340, ACM, 2015.

[50] P. O’Sullivan, K. Anand, A. Kotha, M. Smithson, and R. B. A. D.Keromytis, Retrofitting Security in COTS Software with Binary Rewrit-ing. University of Maryland; Columbia University, 2011.

[51] C. Kruegel, W. Robertson, F. Valeur, and G. Vigna, Static Disassemblyof Obfuscated Binaries. Reliable Software Group; University of SantaBarbara, 2004.

[52] C. Kruegel, W. Robertson, F. Valeur, and G. Vigna, “Static disassemblyof obfuscated binaries,” in USENIX security Symposium, vol. 13, pp. 18–18, 2004.

[53] S. Wang, P. Wang, and D. Wu, “Reassembleable disassembling.,” inUSENIX Security Symposium, pp. 627–642, 2015.

[54] S. Wang, P. Wang, and D. Wu, “Uroboros: Instrumenting strippedbinaries with static reassembling,” in Software Analysis, Evolution, andReengineering (SANER), 2016 IEEE 23rd International Conference on,vol. 1, pp. 236–247, IEEE, 2016.

[55] R. Wang, Y. Shoshitaishvili, A. Bianchi, A. Machiry, J. Grosen,P. Grosen, C. Kruegel, and G. Vigna, “Ramblr: Making reassembly greatagain,” 2017.

[56] M. Sharif, V. Yegneswaran, H. Saidi, P. Porras, and W. Lee, “Eureka: Aframework for enabling static malware analysis,” in European Sympo-sium on Research in Computer Security, pp. 481–500, Springer, 2008.

[57] G. Hoglund and J. Butler, Rootkits: Subverting the Windows Kernel,ch. 4: The Age-Old Art of Hooking, pp. 73–74. Pearson Education,Inc., 2006.

12