Wire - A Formal Intermediate Language for Binary Analysis
-
Upload
silvio-cesare -
Category
Business
-
view
1.338 -
download
2
description
Transcript of Wire - A Formal Intermediate Language for Binary Analysis
Wire – A Formal Intermediate Language for Binary Analysis
Silvio Cesare and Yang XiangSchool of Information TechnologyDeakin University
Introduction - Motivation
• Static analysis has many benefits
• Applications include:• Bug detection• Plagiarism detection• Code optimisation
• Mostly source-level, but binary-level analysis offers additional benefits and applications:• Malware detection• Software theft detection• Bug detection of compiled and link-edited programs
Introduction - Challenges
• Binary analysis is hard.– Even separating code from data is undecidable.– Perfect disassembly of x86 is undecidable.
• Many challenges.– Native CISC architectures have hundreds of
complex instructions.– Native instructions have side effects which require
hidden assumptions in analysis.– Native architectures require separate
implementations on each platform.
Innovation in our work
• Wire - a new formal intermediate language (IL).
• Translation of native assembly to our IL.
• Applications - semantic equivalence proofs of obfuscated assembly.
• Applications - Malwise, a malware classification system from our previous work uses Wire as the IL.
Related Work
• A compiler’s intermediate representation– Three Address Code
• Dynamic Binary Instrumentation– QEMU– VEX (used in Valgrind).
• A decompiler’s intermediate representation– DCC– Boomerang– IDA Pro and HexRays
• Binary analysis– Vine (based on Vex), BIL (BitBlaze)– REIL
i := 0L1:
if i >= 10 goto L2 t0 := i*I
t1 := &bt2 := t1 + I*t2 := t0i := i + 1goto L1
L2:
Translating Native Code (1)
• Load object file format– X86 ELF32, PE32– Some Java class file support.
• Disassemble– Linear Sweep– Recursive Traversal– Speculative
• Translate each native instruction to n three address codes.
𝑛𝑎𝑡𝑖𝑣𝑒¿→ {(n , (𝑂𝑝𝑐𝑜𝑑𝑒 ,𝑂𝑝𝑒𝑟𝑎𝑛𝑑1 ,𝑂𝑝𝑒𝑟𝑎𝑛𝑑2 ,𝑂𝑝𝑒𝑟𝑎𝑛𝑑3 ) )|n∈ℕ }
Translating Native Code (2)
• Map registers between IL’s abstract machine and native architecture.
• Assign labels to beginning of basic blocks.
• Assign results of arithmetic (etc.) instructions to condition code variables:– E.g. eq_cond = mkbool x == y
• Decompile parts of IL for additional information.
• Optimise IL code.
Formal Syntax
Instructions I ::=n i
Heap H::= nxn n
Memory M ::= n n
Register R ::=r n
Labels L ::=l pc
AllocAMemory V ::=nxnn
Instructions: (maps instruction number to instruction)Heap: (maps heap address and memory size to non overlapping memory addresses)Register: (maps register name to numeric value)Memory: (maps address to numeric value)Labels: (maps label to instruction address pc)AllocAMemory: (maps alloca address and memory size to non overlapping memory addresses)
Program p ::= p i | i Instruction i ::= m| m t Type t ::=
u8_t| u16_t| u32_t| s8_t| s16_t| s32_t
Instructions m ::= *(r3) := r1|
r3 := (*r1)|
r3 := r1|
r3 := n|
r3 := uop r1|
r3 := r1 bop r2|
r3 := r1 bop n|
mkbool r1 ucond|
mkbool r1 bcond r2| nop| halt|
label l| jmp
l|
ijmp r| if r1
cond1 jmp l| if r1
cond2 r2 jmp l| lcall
s|
cast(r1, t)|
r3 := getpc()|
r3 := returnaddress()|
pusharg(n, r)|
r3 := malloc(r)|
free(r)|
r3 := alloca(r)
Operations uop ::= -|~|!bop ::=
+,-,*,/,%,>>,<<,|,&,^Conditions ucond ::= == 0|!= 0
bcond ::= ==|!= | >|>=|<|<=
Operands v ::= n (an integer literal)r (a
register)l (a
label)s (a
symbol)
Formal Semantics
• Operational semantics define the state transitions that occur from execution of the program.
where I is the current instruction, P is the program state and
P’ is the new program state.
𝑝𝑟𝑒𝑚𝑖𝑠𝑒1...
𝑝𝑟𝑒𝑚𝑖𝑠𝑒𝑛(𝑖 ,𝑃 )⇒ 𝑃 ′
𝑁𝐴𝑀𝐸
Formal Semantics of Wire
• Control Flow Instructions• Arithmetic Instructions• Boolean Instructions• Memory Access Instructions• Casting Instructions• Decompiled Instructions
– Address Instructions– Memory Allocation Instructions– Procedural Instructions
Formal Semantics Examples
• See paper for full instruction semantics
The LOAD instruction implements a memory read.
The STORE instruction implements a memory write.
Formal Semantics – Three Address Code
Applications
• A formal language leads to formal proofs.
• Equivalence proofs enable detection of obfuscated code in malware.
• We assume the translation from the native assembly architecture to the IL is correct.
Applications - Dead Code Insertion• Dead code or junk code is a semantic nop (no
operation).• Inserted into malware to evade signature detection of
code.• The native assembly and Wire’s three address code is
shown below:
native assembly
Wire’s IL BOPCADD %eax,$50,%eax
BOPCSUB %eax,%50,%eax
ASSIGNC $0,,%eax
ASSIGNC $0,-,%eax
mov $0,%eaxadd $50,%eaxsub $50,%eaxmov $0,%eax
How the equivalence proofs work
• The original code is executed following the operational semantics of Wire.
• In the second part of the proofs, the obfuscated code is executed.
• The proofs are constructed by showing the final states of the two previous parts are the same given the initial states.
Dead Code Insertion Proof
Reg_name(“eax”) = 0Reg_name(“ebx”) = 1Reg_name(“zf”) = 100 In the first part of the dead code equivalence
proof we execute the instructions without the dead code.
In the second part of the proof we execute the instructions with the dead code.
Now we can see that t’’’-pc = s’-pc which means they are semantically equivalent when ignoring the effect the code has on the program counter. We also note that s’ and s’’ are semantically equivalent. We have thus proven the obfuscated and deobfuscate code samples are equivalent.
Applications – Code Reordering
• Code reordering changes the order of instructions while maintaining semantic equivalence.
ASSIGNC $0x2,,%eax
ASSIGNC $1,,%ebx
BOPADD %ebx,%eax,%ebx
ASSIGNC $0x1,-,%ebx
ASSIGNC $2,-,%eax
BOPADD %ebx,%eax,%ebx
mov $2,%eaxmov $1,%ebxadd %eax,%ebx
mov $1,%ebxmov $2,%eaxadd %eax,%ebx
Code Reordering Proof
For the first part of the proof we execute the first instruction sequence.
For the second part of the proof we execute the second instruction sequence.
Thus we see that t’’’-pc = s’’’-pc and therefore the two instruction sequences are semantically equivalent.
Applications – Opaque Predicate Insertion• An opaque predicate is a predicate that always
evaluates to the same value, but this value is hard to determine statically.
xor %eax,%eaxmov $2,%eax
xor %eax,%eaxjnz $0x80482000mov $2,%eax
BOPXOR %eax,%eax,%eax
UMKBOOLIsZero %eax,,%zf
ASSIGNC $2,-,%eax
BOPXOR %eax,%eax,%eax
UMKBOOLIsZero %eax,,%zf
UCJMPIsNotZero %zf,,$target
ASSIGNC $2,-,%eax
Opaque Predicate Insertion Proof
In the first part of the proof we execute the first code sequence.
In the second part of the proof we execute the second code sequence.
We see that register 100 is set which makes the conditional branch in the following instruction use a false condition.
Thus we see that s’’-pc=t’’’’-pc and this proves semantic equivalence
Conclusion
• Wire is a new formal intermediate language.
• Formally defined semantics allow for formal reasoning.
• Wire has demonstrated applications in binary analysis.