© 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary...
-
Upload
javon-brien -
Category
Documents
-
view
218 -
download
3
Transcript of © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary...
Unconventional Code Constructs
© 2006 Nathan Rosenblum
March 2006
The New Dyninst Code Parser: Binary Code Isn't as
Simple as it Used to Be
Nathan RosenblumUniversity of Wisconsin
– 2 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Binary Analysis Processing of the binary code to extract
syntactic and symbolic information from many sources:•Symbol tables (if present)•Decode (disassemble) instructions•Control-flow information: basic blocks, loops, functions
•Data-flow information: from basic register information to highly sophisticated (and expensive) analyses.
– 3 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Products of Binary Analysis High-level organization and characteristics
•Function entry/exit points•Intra-procedural call graph•Inter-procedural control-flow graph•Exception handlers•Jump tables•Virtual function tables
Abstract assembly representation Data-flow characteristics
•Register liveness (for instrumentation, modification)
– 4 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Uses of Binary Analysis Debugging Testing Performance profiling Performance modeling
Behavior Modeling Dynamic Modification Binary Rewriting Reverse engineering
– 5 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Binary Analysis Tool GoalsSafe Eliminate false positives to make
instrumentation safe
Accurate Minimize false negatives for complete view of the binary
Opportunistic Use all available information and techniques to maximum effect
Resilient Tools are robust to unexpected and unusual applications
Automated Analysis does not depend on human interaction
Complementary
Produce products compatible with source-level analysis tools.
– 6 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Why is Binary Analysis Hard?
Func foo()
{
…
switch(a) {
…
}
…
}
push %ebp
mov %esp, %ebp
…
mov [0x1d], %eax
jmp *%eax
…
The Compiler
Source Code Binary
– 7 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Current Approaches Linear disassembly of binaries is insufficient
•Symbol tables often lie, or are absent•Functions are not address ranges, may be
non-contiguous Parsing based on program control flow
•Commonly used approach:
UQBT LEEL
RAD IDA-Pro
Dyninst•Must contend with gaps in known code regions
after parsing
– 8 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Dyninst Control Flow Parsing Opportunistic parsing:
•Utilizes symbol table and other information when available (and sensible)
Provides more accurate view of the binary than linear disassembly
Addresses problem of gaps in the binary through speculative parsing•Heuristics to identify function preambles
– 9 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Control Flow Traversal Illustrated
<func foo>:
00: mov [a8], r1
04: mov [ac], r2
08: add r1, r2, r3
0c: cmp r3, 0
10: bne 24
14: call <bar>
18: add r3, 8, r3
1c: call <baz>
20: jmp 28
24: mul r2, 2, r3
28: sub r1, r3, r1
. . .
00
1424
28
•Parsing follows control flow•Control transfers are edges in the CFG•Target blocks can parsed in any order
– 10 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Control Flow Traversal Illustrated
<func foo>:
00: mov [a8], r1
04: mov [ac], r2
08: add r1, r2, r3
0c: cmp r3, 0
10: bne 24
14: call <bar>
18: add r3, 8, r3
1c: call <baz>
20: jmp 28
24: mul r2, 2, r3
28: sub r1, r3, r1
. . .
•Call sites determine location of functions•Targets of calls are added to the function parsing work list
Known Functionsfooquuxquuuxbarbaz
– 11 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Binary Parsing Challenges Pointer-based control transfer Non-returning calls Non-contiguous code sections Tail calls Gaps in the binary Exception handlers Shared code and multiple entry
representation
– 12 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Non-returning Call Sites Some functions will not return
•Examples: abort, exit Code following call site may not be
valid Even if names are available, calls may
be hard to detect:dfaerror fatal exit
– 13 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Detecting Non-Returning Functions
Goal: detect non-returning functions from first principles
Identify distinguishing features of non-returning functions•Wide variety of
behavior in non-returning functions makes this difficult
Example: operations in abort
abort() ->
sigaction()
IO_flush_all()
raise(SIGABRT) ->
kill(getpid(),sig)
hlt [privileged instruction]
– 14 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Non-returning Call Sites
000214d0 <__assert_fail>:
. . .
2160f: e8 cc db 0a 00 call cf1e0 <__libc_write>
21614: e8 07 7f 00 00 call 29520 <abort>
21619: 90 nop
2161a: 90 nop
2161b: 90 nop
2161c: 90 nop
2161d: 90 nop
2161e: 90 nop
2161f: 90 nop
00021620 <__assert_perror_fail>:
21620: 55 push %ebp
21621: 89 e5 mov %esp,%ebp
. . .
Example: GNU libc library routines
•Call to abort does not return
•Parser will naively follow control into the following region
•Bytes following call site may not be code (e.g., jump tables, other functions, string data)
– 15 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Non-contiguous Code
Func Foo •Functions are not address ranges•Symbol table representation fails•Many sources of non-contiguous layout:
•Jump tables•Data (strings, etc)•Unparsed code•Exception handlers•Padding or junk bytes
– 16 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Non-contiguous Code
. . .
77e7b1cb: 83 41 04 04 addl $0x4,0x4(%ecx)
77e7b1cf: 5d pop %ebp
77e7b1d0: c2 0c 00 ret $0xc
77e7b1d3: 68 f5 06 00 00 push $0x6f5
77e7b1d8: eb 05 jmp 0x77e7b1df
77e7b1da: 68 e6 06 00 00 push $0x6e6
77e7b1df: e8 bb 86 02 00 call 0x77ea389f
77e7b1e4: 4c ba e7 77
77e7b1e8: 34 b2 e7 77
77e7b1ec: b5 b1 e7 77
77e7b1f0: 0c 9f e8 77
77e7b1f4: 96 37 e8 77
77e7b1f8: cf b1 e7 77
77e7b1fc: 00 00 00 00 01 01 01 02 02 02 03 03 04 02 05
77e7b20c: 3c 10 cmp $0x10,%al
77e7b20e: 0f 85 a6 3b 02 00 jne 0x77e9edba
. . .
Example: Microsoft Word
•Jump table separates valid instruction sequences
•Control following call site is invalid
– 17 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Named Non-contiguous Sections
00021060 <__duplocale>:
....
210f0: lock cmpxchg %ecx,0x2968(%ebx)
210f8: jne 2118e
210fe: xor %esi,%esi
21100: cmp $0x6,%esi
...
0002118e <_L_mutex_lock_78>:
2118e: lea 0x2968(%ebx),%ecx
21194: call ea0f0
21199: jmp 210fe
Example: GNU libc library routines
•Looks like shared code
•Fragment is not a real function
– 18 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Named Non-contiguous Sections Recognizing function fragments
•Have a symbol table entry•Reached by branches from one function
•Branch back to one function Use combination of CFG and symbol
table clues
– 19 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Tail Calls
Func Bar
. . .
jmp <quux>
Func Quux
•Compiler has joined two functions into one
•Looks like non-contiguous shared code
. . .
ret
Func Foo. . .
call <bar>
– 20 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Gap Parsing
Func Foo
Func Bar
Unidentified section of code
•Gaps between known code regions may contain undiscovered functions
•Targets of indirect calls
Speculative parsing: pattern-based heuristics to recognize function prologues in gaps
– 21 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Exceptions Exception
handling code is normally unreachable
Use information in the binary where available•Example: Linux ELF exception tables
C++ style exception
catch block
push %ebp
mov %esp,%ebp
push %ebx
sub $0x24,%esp
movl $0x6,0xfffffff8(%ebp)
mov 0x8(%ebp),%eax
mov %eax,(%esp)
call 804aafa
jmp 804abe9
mov %eax,0xfffffff4(%ebp)
cmp $0x2,%edx
je 804ab58
. . .
mov 0xfffffff4(%ebp),%eax
mov %eax,(%esp)
call 804a388
add $0x24,%esp
pop %ebx
pop %ebp
ret
– 22 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Shared Code Models
Shared Code
Func A Func B Code may be shared between functions•Multiple entry
functions•Compiler
optimizations Analysis tools must
be able to recognize and handle overlapping control flow
– 23 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Summary of Binary Analysis Techniques
Control flow traversal is a powerful tool for addressing the challenges of modern binaries•Lying/missing symbol tables•Data/code disambiguation•Jump tables
Speculative parsing techniques can be useful for expanding code coverage•Gaps in code•Indirect calls and branches
– 24 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Incidence of Shared Code in Binaries
0
100
200
300
400
500
600
Number of Binaries
0 4 16 64 256 1024
Functions containing shared code
Parsed 828 Linux/x86 binaries•238 contained
shared code Most binaries
contain only a few code-sharing functions
Some code sharing may be due to non-returning call sites
– 25 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Where Do We Go From Here? Are there good solutions from first
principles?•Almost certainly.•We are just starting to explore the limits of such techniques.
Are special case solutions necessary?•Again, almost certainly.•We will try to use these as sparingly as possible.
– 26 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Future Directions in Binary Analysis
Problem: code exists but is unreachable through standard control-flow traversal parsing•Heuristics are a moving target
Existing opportunistic parsing techniques can help, but only to an extent•Exception handlers, virtual function tables
may be recoverable from the binary Given the information we can recover from
traditional techniques, can we synthesize additional information that will increase coverage of the binary?
– 27 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Statistical Binary Parsing Can we utilize known code to find
unknown code?•We have a partial parse of the binary•Code unknown regions of the binary will likely share characteristics with previously identified code
Identify code in unknown regions:•Create a probabilistic model of valid code
•Identify sections of unknown regions in the binary that are similar to valid code
– 28 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Binary Modeling Techniques Code idioms are one possibility for
validating potential code•Function preambles, jump table bounds
tests, system call stubs, case statements Idioms can be identified manually Model can be trained to identify new idioms
with machine learning techniques•n-gram models, long-distance interaction
Unparsed code can be scored to indicate its statistical similarity to known code
– 29 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Open Questions in Binary Analysis
What learning techniques will yield the best results?
How can we overcome the relative dearth of information in binaries with very little code reachable through control flow analysis?•Incorporate information from analysis of other binaries
What techniques will allow us to accurately identify the range of recognizable code?
– 30 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Questions?
– 31 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Backup Slides
– 32 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Shared Code Models
Shared Code Multiple Entry
Func A Func B Entry A Entry B
What is the difference from the perspective of the parser?
– 33 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
A Choice of Abstraction Shared code and multiple entry
models are similar•Represent independent flows of control merging together
Shared model is a better fit for Dyninst•Preserves semantic guarantees of function independence
– 34 – Unconventional Code Constructs
© 2006 Nathan Rosenblum
Shared Code
000a94c0 <__waitpid>:
a94c0: cmpl $0x0,%gs:0xc
a94c8: jne a94e7
000a94ca <__waitpid_nocancel>:
a94ca: push %ebx
a94cb: mov 0x10(%esp,1),%edx
a94cf: mov 0xc(%esp,1),%ecx
a94d3: mov 0x8(%esp,1),%ebx
a94d7: mov $0x7,%eax
a94dc: int $0x80
a94de: pop %ebx
a94df: cmp $0xfffff001,%eax
a94e4: jae a9513
. . .
Code common to the two functions is marked as shared.
Example: GNU libc library routines