Wire - A Formal Intermediate Language for Binary Analysis

21
Wire – A Formal Intermediate Language for Binary Analysis Silvio Cesare and Yang Xiang School of Information Technology Deakin University

description

 

Transcript of Wire - A Formal Intermediate Language for Binary Analysis

Page 1: Wire - A Formal Intermediate Language for Binary Analysis

Wire – A Formal Intermediate Language for Binary Analysis

Silvio Cesare and Yang XiangSchool of Information TechnologyDeakin University

Page 2: Wire - A Formal Intermediate Language for Binary Analysis

Introduction - Motivation

• Static analysis has many benefits

• Applications include:• Bug detection• Plagiarism detection• Code optimisation

• Mostly source-level, but binary-level analysis offers additional benefits and applications:• Malware detection• Software theft detection• Bug detection of compiled and link-edited programs

Page 3: Wire - A Formal Intermediate Language for Binary Analysis

Introduction - Challenges

• Binary analysis is hard.– Even separating code from data is undecidable.– Perfect disassembly of x86 is undecidable.

• Many challenges.– Native CISC architectures have hundreds of

complex instructions.– Native instructions have side effects which require

hidden assumptions in analysis.– Native architectures require separate

implementations on each platform.

Page 4: Wire - A Formal Intermediate Language for Binary Analysis

Innovation in our work

• Wire - a new formal intermediate language (IL).

• Translation of native assembly to our IL.

• Applications - semantic equivalence proofs of obfuscated assembly.

• Applications - Malwise, a malware classification system from our previous work uses Wire as the IL.

Page 5: Wire - A Formal Intermediate Language for Binary Analysis

Related Work

• A compiler’s intermediate representation– Three Address Code

• Dynamic Binary Instrumentation– QEMU– VEX (used in Valgrind).

• A decompiler’s intermediate representation– DCC– Boomerang– IDA Pro and HexRays

• Binary analysis– Vine (based on Vex), BIL (BitBlaze)– REIL

i := 0L1:

if i >= 10 goto L2 t0 := i*I

t1 := &bt2 := t1 + I*t2 := t0i := i + 1goto L1

L2:

Page 6: Wire - A Formal Intermediate Language for Binary Analysis

Translating Native Code (1)

• Load object file format– X86 ELF32, PE32– Some Java class file support.

• Disassemble– Linear Sweep– Recursive Traversal– Speculative

• Translate each native instruction to n three address codes.

𝑛𝑎𝑡𝑖𝑣𝑒¿→ {(n , (𝑂𝑝𝑐𝑜𝑑𝑒 ,𝑂𝑝𝑒𝑟𝑎𝑛𝑑1 ,𝑂𝑝𝑒𝑟𝑎𝑛𝑑2 ,𝑂𝑝𝑒𝑟𝑎𝑛𝑑3 ) )|n∈ℕ }

Page 7: Wire - A Formal Intermediate Language for Binary Analysis

Translating Native Code (2)

• Map registers between IL’s abstract machine and native architecture.

• Assign labels to beginning of basic blocks.

• Assign results of arithmetic (etc.) instructions to condition code variables:– E.g. eq_cond = mkbool x == y

• Decompile parts of IL for additional information.

• Optimise IL code.

Page 8: Wire - A Formal Intermediate Language for Binary Analysis

Formal Syntax

Instructions I ::=n i

Heap H::= nxn n

Memory M ::= n n

Register R ::=r n

Labels L ::=l pc

AllocAMemory V ::=nxnn

 Instructions: (maps instruction number to instruction)Heap: (maps heap address and memory size to non overlapping memory addresses)Register: (maps register name to numeric value)Memory: (maps address to numeric value)Labels: (maps label to instruction address pc)AllocAMemory: (maps alloca address and memory size to non overlapping memory addresses)

Program p ::= p i | i Instruction i ::= m| m t Type t ::=

u8_t| u16_t| u32_t| s8_t| s16_t| s32_t

Instructions m ::= *(r3) := r1|

r3 := (*r1)|

r3 := r1|

r3 := n|

r3 := uop r1|

r3 := r1 bop r2|

r3 := r1 bop n|

mkbool r1 ucond|

mkbool r1 bcond r2| nop| halt|

label l| jmp

l|

ijmp r| if r1

cond1 jmp l| if r1

cond2 r2 jmp l| lcall

s|

cast(r1, t)|

r3 := getpc()|

r3 := returnaddress()|

pusharg(n, r)|

r3 := malloc(r)|

free(r)|

r3 := alloca(r)

Operations uop ::= -|~|!bop ::=

+,-,*,/,%,>>,<<,|,&,^Conditions ucond ::= == 0|!= 0

bcond ::= ==|!= | >|>=|<|<=

Operands v ::= n (an integer literal)r (a

register)l (a

label)s (a

symbol)

Page 9: Wire - A Formal Intermediate Language for Binary Analysis

Formal Semantics

• Operational semantics define the state transitions that occur from execution of the program.

where I is the current instruction, P is the program state and

P’ is the new program state.

𝑝𝑟𝑒𝑚𝑖𝑠𝑒1...

𝑝𝑟𝑒𝑚𝑖𝑠𝑒𝑛(𝑖 ,𝑃 )⇒ 𝑃 ′

𝑁𝐴𝑀𝐸

Page 10: Wire - A Formal Intermediate Language for Binary Analysis

Formal Semantics of Wire

• Control Flow Instructions• Arithmetic Instructions• Boolean Instructions• Memory Access Instructions• Casting Instructions• Decompiled Instructions

– Address Instructions– Memory Allocation Instructions– Procedural Instructions

Page 11: Wire - A Formal Intermediate Language for Binary Analysis

Formal Semantics Examples

• See paper for full instruction semantics

The LOAD instruction implements a memory read.

The STORE instruction implements a memory write.

Page 12: Wire - A Formal Intermediate Language for Binary Analysis

Formal Semantics – Three Address Code

Page 13: Wire - A Formal Intermediate Language for Binary Analysis

Applications

• A formal language leads to formal proofs.

• Equivalence proofs enable detection of obfuscated code in malware.

• We assume the translation from the native assembly architecture to the IL is correct.

Page 14: Wire - A Formal Intermediate Language for Binary Analysis

Applications - Dead Code Insertion• Dead code or junk code is a semantic nop (no

operation).• Inserted into malware to evade signature detection of

code.• The native assembly and Wire’s three address code is

shown below:

native assembly

Wire’s IL BOPCADD %eax,$50,%eax

BOPCSUB %eax,%50,%eax

ASSIGNC $0,,%eax

ASSIGNC $0,-,%eax

mov $0,%eaxadd $50,%eaxsub $50,%eaxmov $0,%eax

Page 15: Wire - A Formal Intermediate Language for Binary Analysis

How the equivalence proofs work

• The original code is executed following the operational semantics of Wire.

• In the second part of the proofs, the obfuscated code is executed.

• The proofs are constructed by showing the final states of the two previous parts are the same given the initial states.

Page 16: Wire - A Formal Intermediate Language for Binary Analysis

Dead Code Insertion Proof

Reg_name(“eax”) = 0Reg_name(“ebx”) = 1Reg_name(“zf”) = 100 In the first part of the dead code equivalence

proof we execute the instructions without the dead code.

In the second part of the proof we execute the instructions with the dead code.

Now we can see that t’’’-pc = s’-pc which means they are semantically equivalent when ignoring the effect the code has on the program counter. We also note that s’ and s’’ are semantically equivalent. We have thus proven the obfuscated and deobfuscate code samples are equivalent.

Page 17: Wire - A Formal Intermediate Language for Binary Analysis

Applications – Code Reordering

• Code reordering changes the order of instructions while maintaining semantic equivalence.

ASSIGNC $0x2,,%eax

ASSIGNC $1,,%ebx

BOPADD %ebx,%eax,%ebx

ASSIGNC $0x1,-,%ebx

ASSIGNC $2,-,%eax

BOPADD %ebx,%eax,%ebx

mov $2,%eaxmov $1,%ebxadd %eax,%ebx

mov $1,%ebxmov $2,%eaxadd %eax,%ebx

Page 18: Wire - A Formal Intermediate Language for Binary Analysis

Code Reordering Proof

For the first part of the proof we execute the first instruction sequence.

For the second part of the proof we execute the second instruction sequence.

Thus we see that t’’’-pc = s’’’-pc and therefore the two instruction sequences are semantically equivalent.

Page 19: Wire - A Formal Intermediate Language for Binary Analysis

Applications – Opaque Predicate Insertion• An opaque predicate is a predicate that always

evaluates to the same value, but this value is hard to determine statically.

xor %eax,%eaxmov $2,%eax

xor %eax,%eaxjnz $0x80482000mov $2,%eax

BOPXOR %eax,%eax,%eax

UMKBOOLIsZero %eax,,%zf

ASSIGNC $2,-,%eax

BOPXOR %eax,%eax,%eax

UMKBOOLIsZero %eax,,%zf

UCJMPIsNotZero %zf,,$target

ASSIGNC $2,-,%eax

Page 20: Wire - A Formal Intermediate Language for Binary Analysis

Opaque Predicate Insertion Proof

In the first part of the proof we execute the first code sequence.

 

In the second part of the proof we execute the second code sequence.

We see that register 100 is set which makes the conditional branch in the following instruction use a false condition.

 Thus we see that s’’-pc=t’’’’-pc and this proves semantic equivalence

Page 21: Wire - A Formal Intermediate Language for Binary Analysis

Conclusion

• Wire is a new formal intermediate language.

• Formally defined semantics allow for formal reasoning.

• Wire has demonstrated applications in binary analysis.