Wire - A Formal Intermediate Language for Binary Analysis

Wire – A Formal Intermediate Language for Binary Analysis

Silvio Cesare and Yang XiangSchool of Information TechnologyDeakin University

Introduction - Motivation

• Static analysis has many benefits

• Applications include:• Bug detection• Plagiarism detection• Code optimisation

• Mostly source-level, but binary-level analysis offers additional benefits and applications:• Malware detection• Software theft detection• Bug detection of compiled and link-edited programs

Introduction - Challenges

• Binary analysis is hard.– Even separating code from data is undecidable.– Perfect disassembly of x86 is undecidable.

• Many challenges.– Native CISC architectures have hundreds of

complex instructions.– Native instructions have side effects which require

hidden assumptions in analysis.– Native architectures require separate

implementations on each platform.

Innovation in our work

• Wire - a new formal intermediate language (IL).

• Translation of native assembly to our IL.

• Applications - semantic equivalence proofs of obfuscated assembly.

• Applications - Malwise, a malware classification system from our previous work uses Wire as the IL.

Related Work

• A compiler’s intermediate representation– Three Address Code

• Dynamic Binary Instrumentation– QEMU– VEX (used in Valgrind).

• A decompiler’s intermediate representation– DCC– Boomerang– IDA Pro and HexRays

• Binary analysis– Vine (based on Vex), BIL (BitBlaze)– REIL

i := 0L1:

if i >= 10 goto L2 t0 := i*I

t1 := &bt2 := t1 + I*t2 := t0i := i + 1goto L1

L2:

Translating Native Code (1)

• Load object file format– X86 ELF32, PE32– Some Java class file support.

• Disassemble– Linear Sweep– Recursive Traversal– Speculative

• Translate each native instruction to n three address codes.

𝑛𝑎𝑡𝑖𝑣𝑒¿→ {(n , (𝑂𝑝𝑐𝑜𝑑𝑒 ,𝑂𝑝𝑒𝑟𝑎𝑛𝑑1 ,𝑂𝑝𝑒𝑟𝑎𝑛𝑑2 ,𝑂𝑝𝑒𝑟𝑎𝑛𝑑3 ) )|n∈ℕ }

Translating Native Code (2)

• Map registers between IL’s abstract machine and native architecture.

• Assign labels to beginning of basic blocks.

• Assign results of arithmetic (etc.) instructions to condition code variables:– E.g. eq_cond = mkbool x == y

• Decompile parts of IL for additional information.

• Optimise IL code.

Formal Semantics

• Operational semantics define the state transitions that occur from execution of the program.

where I is the current instruction, P is the program state and

P’ is the new program state.

𝑝𝑟𝑒𝑚𝑖𝑠𝑒1...

𝑝𝑟𝑒𝑚𝑖𝑠𝑒𝑛(𝑖 ,𝑃 )⇒ 𝑃 ′

𝑁𝐴𝑀𝐸

Formal Semantics of Wire

• Control Flow Instructions• Arithmetic Instructions• Boolean Instructions• Memory Access Instructions• Casting Instructions• Decompiled Instructions

– Address Instructions– Memory Allocation Instructions– Procedural Instructions

Formal Semantics Examples

• See paper for full instruction semantics

The LOAD instruction implements a memory read.

The STORE instruction implements a memory write.

Formal Semantics – Three Address Code

Applications

• A formal language leads to formal proofs.

• Equivalence proofs enable detection of obfuscated code in malware.

• We assume the translation from the native assembly architecture to the IL is correct.

Applications - Dead Code Insertion• Dead code or junk code is a semantic nop (no

operation).• Inserted into malware to evade signature detection of

code.• The native assembly and Wire’s three address code is

shown below:

native assembly

Wire’s IL BOPCADD %eax,$50,%eax

BOPCSUB %eax,%50,%eax

ASSIGNC $0,,%eax

ASSIGNC $0,-,%eax

mov $0,%eaxadd $50,%eaxsub $50,%eaxmov $0,%eax

How the equivalence proofs work

• The original code is executed following the operational semantics of Wire.

• In the second part of the proofs, the obfuscated code is executed.

• The proofs are constructed by showing the final states of the two previous parts are the same given the initial states.

Dead Code Insertion Proof

Reg_name(“eax”) = 0Reg_name(“ebx”) = 1Reg_name(“zf”) = 100 In the first part of the dead code equivalence

proof we execute the instructions without the dead code.

In the second part of the proof we execute the instructions with the dead code.

Now we can see that t’’’-pc = s’-pc which means they are semantically equivalent when ignoring the effect the code has on the program counter. We also note that s’ and s’’ are semantically equivalent. We have thus proven the obfuscated and deobfuscate code samples are equivalent.

Applications – Code Reordering

• Code reordering changes the order of instructions while maintaining semantic equivalence.

ASSIGNC $0x2,,%eax

ASSIGNC $1,,%ebx

BOPADD %ebx,%eax,%ebx

ASSIGNC $0x1,-,%ebx

ASSIGNC $2,-,%eax

BOPADD %ebx,%eax,%ebx

mov $2,%eaxmov $1,%ebxadd %eax,%ebx

mov $1,%ebxmov $2,%eaxadd %eax,%ebx

Code Reordering Proof

For the first part of the proof we execute the first instruction sequence.

For the second part of the proof we execute the second instruction sequence.

Thus we see that t’’’-pc = s’’’-pc and therefore the two instruction sequences are semantically equivalent.

Applications – Opaque Predicate Insertion• An opaque predicate is a predicate that always

evaluates to the same value, but this value is hard to determine statically.

xor %eax,%eaxmov $2,%eax

xor %eax,%eaxjnz $0x80482000mov $2,%eax

BOPXOR %eax,%eax,%eax

UMKBOOLIsZero %eax,,%zf

ASSIGNC $2,-,%eax

BOPXOR %eax,%eax,%eax

UMKBOOLIsZero %eax,,%zf

UCJMPIsNotZero %zf,,$target

ASSIGNC $2,-,%eax

Opaque Predicate Insertion Proof

In the first part of the proof we execute the first code sequence.

In the second part of the proof we execute the second code sequence.

We see that register 100 is set which makes the conditional branch in the following instruction use a false condition.

Thus we see that s’’-pc=t’’’’-pc and this proves semantic equivalence

Conclusion

• Wire is a new formal intermediate language.

• Formally defined semantics allow for formal reasoning.

• Wire has demonstrated applications in binary analysis.

Wire - A Formal Intermediate Language for Binary Analysis

Business

Transcript of Wire - A Formal Intermediate Language for Binary Analysis