Binary Analysis and Reverse Engineering
Transcript of Binary Analysis and Reverse Engineering
Pattern Recognitionand Applications Lab
Universityof Cagliari, Italy
Department of Electrical and Electronic Engineering
Binary Analysis and Reverse Engineering
Ing. Davide Maiorca, Ph.D.
Computer Security – A.Y. 2017/2018
http://pralab.diee.unica.it
Contents
• Introduction• ELF Structure
– Introduction to readelf– From ELF to Memory
• Static Analysis– Assembly X86 Basics and introduction to objdump– Memory analysis during function calls
• Dynamic Analysis– Introduction to gdb– Dynamic Analysis of Memory
2
http://pralab.diee.unica.it
Introduction
3
http://pralab.diee.unica.it
About Me
• February 2012– Master of Science– Electronic Engineering cum laude
• November 2013 – April 2014– Visiting Student – Ruhr Universität Bochum (Prof. Dr. Thorsten Holz)
• March 2016– Ph.D. – Doctor Europaeus – University of Cagliari
– CLUSIT (Italian Association For Computer Security) Thesis Prize winner
• Currently– PostDoctoral Fellow – University of Cagliari
• Resarch Topics– Malware Analysis and Detection in Documents (PDF, Word, Flash…)
– Android Malware Analysis and Detection
– Adversarial Machine Learning
• Other Activities– Mobile Forensics
– Program Committees for Conferences
4
http://pralab.diee.unica.it
Virtual Machine
• Runs Ubuntu• You can find the files of this lecturer
directly on the VM• User: sicurezza1718 • Password: security1718
5
http://pralab.diee.unica.it
Introduction – Binary Analysis
• In Computer Security, we are often interested in findinganomalies in a executable (binary) program:– “Hidden” actions– Possible attacks or attempts to steal information– Bugs
• However, we often do not have the source code of the program
• To analyze a binary, we must therefore resort to reverse engineering techniques– This is often the only way to understand something of a
program!– A very complex art!
6
http://pralab.diee.unica.it
Introduction - Reverse Engineering
• When you program, you usually switch from source codes to binary files
• Problem: are you really sure that the binary exactly behavesas you wanted? – You already know it is not that simple...– We often refer to «bug» when defining a
wrong/unexpected behavior of a program• Reverse Engineering: analyzing the code of an already
compiled program to understand its behavior– This means that you are going to see how a program
works in a very detailed way• Get ready to make your hands dirty! J
7
http://pralab.diee.unica.it
Challenge!
• Our lectures will be challenge-driven• The idea is acquiring the concepts that will allow
you, in practice, to solve the challenge• In the virtual machine, you should find a file called
sum_number• Try to run it...• You do not have the source code, so answering the
question is at the moment not possible• However, thanks to what you learn in these
lectures, you will be able to unveil many mysteries…
8
http://pralab.diee.unica.it
ELF Structure
9
http://pralab.diee.unica.it
Creazione di un Eseguibile (Linux)
Source (.c) Object File (.o) ELF Executable
Compiler Linker
10
http://pralab.diee.unica.it
ELF
• Executable and Linkable Format• Executable for Linux (32 and 64 bit)
– 32 and 64 bit executables are NOT the same– Memory management and specific instructions are different
• The executable is composed of four parts• ELF Header
– Basic information about the file (e.g., architecture type, addresses, and section sizes)
• Section Header– Describes the position of all sections of the executable (compulsory in .o files)
• Program Header– Describes the executable sections that are loaded in memory during the
program execution (segments – compulsory in the executable file)• Data
– The real file data
11
http://pralab.diee.unica.it
ELF (2)
Source (.c) Object File (.o) ELF Executable
Compiler Linker
The Linker changes the addresses of the file sectionsdepending on specific needs(relocation)
12
http://pralab.diee.unica.it
ELF Header
• ELF files can be analyzed in practice with different tools• We start with readelf
– Can be found in any Linux distribution– Provides information on the file structure
• Let’s start...– readelf –h sum_number
• Yes, that’s the header executable!• Magic number
– Four bytes that define the file type• Entry point
– VIRTUAL memory address that identifies the start of the program– Does the program really begin with ‘main’? You faith is going to be
changed soon...J• Provides information about offsets and sizes of the ELF sections
13
http://pralab.diee.unica.it
Section Header
• Let’s dig deeper in the file…– readelf –S sum_number– 35 sections (not all of them are important)!– We are considering already relocated sections (complete
executable)• .text
– Instructions of the process and read-only data– Changes to these values-> Segmentation Fault!– Read-only data are generally marked
• .data– Initialized static data
• .bss – Non initialized static data
14
http://pralab.diee.unica.it
Section Types and Parameters
• Types– PROGBITS: sections that contain data that are actually used by the
program– NOTE: extra data that are not useful for the execution of the program
SYMTAB/DYNSYM: sections that contain information about symbols. Symbols are names that represent data that are used by machine code
– STRTAB: section that contains strings that are used by the executable – REL: relocation table
• Other parameters– Address: virtual memory address of the section– Size: size of the section– Offset: starting point inside the file– Flag: execution flags– You can overlook the other parameters
15
http://pralab.diee.unica.it
Program Header
• To access the program header, just type:– readelf –l sum_number
• Program Header is composed of segments– Each segment is composed of a group of sections
• LOAD type segments are loaded in memory when the program is run• In our example, segment 02 contains the .text section
– Contains the machine code (flags: Read/Execute) -> This segment is often named.text
• Segment 03 contains .data e .bss sections– .data and .bss represent,respectively,initialized and uninitialized data (flags:
Read/Write)– When representing memory, sections .data and .bss are considered as separate
segments• Careful with offsets!
– PHDR is the program header table (in our example, it starts from offset 52)– LOAD starts from offset 0 (from the file start), but only uses the first 0x6a4 bytes,
although 0x1000 are loaded (i.e., 4096 – aligment value due to memory paging)
16
http://pralab.diee.unica.it
From ELF to Memory
Memory (note how the addressof .text is lower than .data)
Executable
Addresses thatincrement towardsthe bottom
17
http://pralab.diee.unica.it
Linux X86 Process in Memory - Structure
Stack always accumulates towards loweraddresses (the opposite of the process)!
Stackframe Base
Note that, in the picture on the left, .text sectionis on the lower part of the memory, but addressesare growing towards up!
18
http://pralab.diee.unica.it
Linux X86 Process in Memory - Stack (2)
• Heap– Dynamically allocated memory
• Stack– Composed of frames– Contains information about functions (paramters, return addresses, local
variables…)• Everytime a function is called, a frame is allocated in memory• Function arguments
– Arguments that the function receives• Return address
– The address to which the function returns at its end• Frame pointer
– It is considered as the «base» of the frame• Local variables
– Variable that are defined in the function
19
http://pralab.diee.unica.it
Static Analysis
20
http://pralab.diee.unica.it
Disassembling an Executable
• Until now, we have inspected the structure of the executable• Now it’s time to understand what the executable does• We want to understand which instructions the processor
really executes by not executing the file itself (static analysis)• This is called disassembling• To this end, we can use the tool objdump
– objdump –d sum_number• Static Analysis has a lot of advantages:
– It’s usually very fast (especially if made automatically)– It immediately provides a lot of information – Avoids executing the file!
21
http://pralab.diee.unica.it
Assembly X86 Basics
• Intel CISC– Complex Instruction Set Computer– A lot of instructions! (but we will only use a small subset)
• AT&T Convention for instructions (opcode, source, destination)– Used by Linux (Windows uses the Intel convention, where source
and destination are reversed)• 32 bit Addressing• Little endian!
– LESS significant bytes go to LOWER addresses• Example for word 0x90AB12CD Memory
AddressSaved Byte
1003 901002 AB1001 12
1000 CD 22
http://pralab.diee.unica.it
Memory Addressing
• Be VERY careful to little endianess• It can be confusing at times!• Whenever a pointer refers to a memory block…• …this will ALWAYS point to the LOWEST part of the block
Memory Address
Saved Byte
0xbfff0007 00
0xbfff0006 00
0xbfff0005 00
0xbfff0004 7B
0xbfff0003 0x00
0xbfff0002 0x33
0xbfff0001 0x32
0xbfff0000 0x31Start of First Block
Consider two addresses:
0xbfff0000 and
0xbfff0004. On the block pointed
by the first address, you save an
ARRAY ‘123’ (which is
represented, in hex, by 0x31
0x32 0x33), whilst on the second
block you save the NUMBER 123
(represented by 0x7b)
Start of Second Block
End of First Block
End of Second Block
EACH WORD IS ALWAYS READ BY CONSIDERING 4 BYTES FROM THE BLOCK START
0xbfff0000: 0x003332310xbfff0004: 0x0000007B
23
http://pralab.diee.unica.it
Assembly X86 – Registers and Instructions
• 8 «General purpose» registers + 1 that points to the next instruction (we are only going to consider the ones used by our example!)– EAX, EDX: «Accumulator» registers– ESP: Stack Pointer– EBP: Pointer to the stackframe base (when a function is called)– EIP: Pointer to the next instruction
• Basic Instructions– PUSH: Push a word to the stack– POP: Removes a word from the stack– MOV: Moves a value from register to register or from register to memory– MOVL: Moves a 4 byte word from a register to memory (and viceversa)– AND: Logical AND operation– ADD/SUB: adds/removes a value from a register– LEAVE: Complete some operations on the stack (see next slides)– RET: Same as return– CALL: Calls a function– NOP: Does not execute anything– The operation xcgh %ax %ax can be considered similar to a NOP (but we will not add details)
24
http://pralab.diee.unica.it
DISCLAIMER
REGISTER VALUES (ebp, esp, eax…) CAN VARY DEPENDING ON THE ARCHITECTURE AND ON THE
OPERATING SYSTEMS, AND IN THESE SLIDES YOU WILL ONLY FIND AN EXAMPLE TAKEN
FROM AN EXECUTION OF THE FILE IN A VIRTUAL ENVIRONMENT
25
http://pralab.diee.unica.it
First Look
• Let’s have a look at the section .text retrieved with objdump• There are a lot of functions and instructions• A C program starts (in its source code) from the function main
– Let’s look for it!• What can we “intuitively” grasp from this function?• The first thing we can look for is retrieving other function calls
– Let’s look then for «call» instructions• We see that three functions are actively called: sum, printf, puts
– Puts is like printf without formatting. It is often used to printnewlines
• Therefore, our program calls a function called ‘sum’ and printssomething, along with a newline!
26
http://pralab.diee.unica.it
Static Analysis of Code - mainStackframe (main function)
ESP
Starting situation
Each «block» is composed of 4 bytesMain is always called by a routine called _start (a compiled Assembly program does not start frommain…)
Before calling a new function, the caller pushes to thestack the return address from which the programresumes its flow
esp = 0xbffff078_start Return ADDRESS
27
http://pralab.diee.unica.it
Static Analysis of Code - mainStackframe (main function)
ESP push %ebp
The old stackframe base pointer is saved
PUSH FIRST MOVES THE POINTER BY 4BYTES, THEN IT WRITES THE ELEMENT!
EBP
esp = 0xbffff078_start Return ADDRESS
28
http://pralab.diee.unica.it
Static Analysis of Code - MainStackframe (main function)
ESP
push %ebpmov %esp %ebp
Now the current stackframe base pointer pointsto the base of the main stackframe (the pointerwas located in the _start function)
EBPEBP
_start Return ADDRESS esp = 0xbffff078ebp = 0xbffff078
29
http://pralab.diee.unica.it
Static Analysis of Code – Memory Allocation
Stackframe (main function)
ESP
push %ebp
mov %esp %ebp
and 0xffffff0 %esp
We are preparing the program to free some
space to store local variables and parameters
This instruction moves ESP to a location whose
address is a multiple of 16. Intel Processors feature
special instructions which always require that ESP stays
in a memory address that is multiple of 16 after the
space for variables has been prepared. This preliminary
instruction ensures that, when the space is completely
ready, ESP always points to an address that is multiple
of 16 (see next slide)
EBP
EBP
esp = 0xbffff070
ebp = 0xbffff078
_start Return ADDRESS
30
http://pralab.diee.unica.it
Static Analysis of Code – Memory AllocationStackframe (main function)
ESP
push %ebpmov %esp %ebpand 0xffffff0 %espsub 0x20, %esp
ESP moves 32 bytes down through the stack(0x20) in order to free some space for localvariables, as well as for parameters of anotherfunction
We are decreasing by a multiple of 16 (seeprevious slide)
PUSH+MOV+(AND)+SUB -> This is typicallydone when a function wants to call anotherone!
EBPEBP
esp = 0xbffff050ebp = 0xbffff078
_start Return ADDRESS
31
http://pralab.diee.unica.it
Static Analysis of Code – Function Call
Stackframe (main function)
ESP
push %ebp
mov %esp %ebp
and $0xffffff0 %esp
sub $0x20, %esp
movl $0x5, 0x4(esp)
After some space has been freed, when a new
function is called the caller starts pushing the
new function parameters (they ALWAYS go at
the end of the stackframe in a reverse order->
The first parameter ALWAYS goes to the
bottom).
EBP
EBP
esp = 0xbffff050
ebp = 0xbffff078
5
_start Return ADDRESS
32
http://pralab.diee.unica.it
Static Analysis of Code – Function Call
Stackframe (main function)
ESP
push %ebp
mov %esp %ebp
and $0xffffff0 %esp
sub $0x20, %esp
movl $0x5, 0x4(esp)
movl $0x4, (esp)
The second parameter (the first one in the C
code) is pushed. The function takes two
parameters whose values are 4 and 5 -> func(4,
5)
EBP
EBP
esp = 0xbffff050
ebp = 0xbffff078
5
4
_start Return ADDRESS
33
http://pralab.diee.unica.it
Static Analysis of Code – Function CallStackframe (main and sum functions)
ESPpush %ebpmov %esp %ebpand $0xffffff0 %espsub $0x20, %espmovl $0x5, 0x4(esp)movl $0x4, (esp)call 804844d <sum>
Calling a new function meanssaving in the stack the returnaddress (which means, the address of the next instruction of the main function) and goingto the beginning of the new function
5 esp = 0xbffff04cebp = 0xbffff048
4
RETURN ADDRESS
main stackframe
Sum stackframe(NOTE: Even ifconceptually thepassed parametersare part of the newfunction, it iscommon to considerthe return addressas the start of thenew stackframe)
34
http://pralab.diee.unica.it
Static Analysis of Code – Sum FunctionStackframe (main and sum functions)
ESP
...movl $0x5, 0x4(esp)movl $0x4, (esp)call 804844d <sum>
5 esp = 0xbffff048ebp = 0xbffff048
4
RETURN ADDRESS
main stackframe
sum stackframepush ebpmov %esp, %ebp....
EBPEBP main()
The new function always loads in its stackframethe EBP of the caller (in this case, the EBP ofmain is saved)
35
http://pralab.diee.unica.it
Static Analysis of Code – Sum FunctionStackframe (main and sum functions)
ESP ...movl $0x5, 0x4(esp)movl $0x4, (esp)call 804844d <sum>
5 esp = 0xbffff04cebp = 0xbffff078
4
RETURN ADDRESS
main stackframe
sum stackframepush ebpmov %esp, %ebp....leave
leave completely cleans the stackframe of the leavingfunction and restores the ebp
36
http://pralab.diee.unica.it
Static Analysis of Code – Sum FunctionStackframe (main and sum functions)
ESP...movl $0x5, 0x4(esp)movl $0x4, (esp)call 804844d <sum>
5 esp = 0xbffff030ebp = 0xbffff058
4main stackframe
sum stackframepush ebpmov %esp, %ebp....leaveret
ret loads, by using a POP instruction (the opposite of PUSH, it removes the elementfrom the stack and goes 4 bytes back) the return addresson eip (next instructionregister) 37
http://pralab.diee.unica.it
Analisi statica del codice – Funzione SumStack Frame (Per la funzione main)
ESP
push %ebp
mov %esp %ebp
and $0xffffff0 %esp
sub $0x20, %esp
movl $0x5, 0x4(esp)
movl $0x4, (esp)
call 804844d <sum>
mov %eax, 0x1c(esp)
%eax contains the result of the sum function,
which is stored under the location pointed by
EBP. The location should be EBP-4, but the
alignment instruction (in blue) further moves
everything by 8 bytes
EBPEBP
esp = 0xbffff050
ebp = 0xbffff078
5
4
9
_start Return ADDRESS
38
http://pralab.diee.unica.it
Static Analysis of Code – Calling printfStackframe (main function)
ESP
push %ebpmov %esp %ebpand $0xffffff0 %espsub $0x20, %espmovl $0x5, 0x4(%esp)movl $0x4, (%esp)call 804844d <sum>mov %eax, 0x1c(%esp)mov %eax, 0x4(%esp)
Load parameters for the next call
EBPEBP
esp = 0xbffff050ebp = 0xbffff078
9
4
9
_start Return ADDRESS
39
http://pralab.diee.unica.it
Static Analysis of the Code – Calling printfStackframe (main function)
ESP
push %ebpmov %esp %ebpand $0xffffff0 %espsub $0x20, %espmovl $0x5, 0x4(%esp)movl $0x4, (%esp)call 804844d <sum>mov %eax, 0x1c(%esp)mov %eax, 0x4(%esp)movl $0x8048540, (%esp)
This address refers to a string(which are stored in dedicated sections of the file)
EBPEBP
esp = 0xbffff050ebp = 0xbffff078
9
0x8048540
9
_start Return ADDRESS
40
http://pralab.diee.unica.it
Static Analysis of the Code – Calling printfStackframe (main function)
ESP
push %ebpmov %esp %ebpand $0xffffff0 %espsub $0x20, %espmovl $0x5, 0x4(%esp)movl $0x4, (%esp)call 804844d <sum>mov %eax, 0x1c(%esp)mov %eax, 0x4(%esp)movl $0x8048540, (%esp)call 8048310 <printf@plt>...
Calls printf (which takes as parameters a stringand a value to print)
EBPEBP
esp = 0xbffff050ebp = 0xbffff078
9
0x08048540
9
_start Return ADDRESS
41
http://pralab.diee.unica.it
Further notes…
• To fully understand the solution of the challenge, you alsohave to analyze the “sum” function…
• The principle is the same as the one of the main function(even simpler!)
• Can you find the solution to the challenge by using staticanalysis?
• Additional question: can you guess the solution of the challenge by only inspecting the main function?
42
http://pralab.diee.unica.it
Dynamic Analysis
43
http://pralab.diee.unica.it
Dynamic Analysis
• Static analysis provides valuable information on the executable
• However, this is often not enough!– Some information is only available at runtime…– Understanding the register values by only using static
analysis might be too complex! – The executable is obfuscated to complicate Static
Analysis…• To cope with these problems, dynamic analysis can be really
helpful• Dynamic Analysis monitors the execution of the program,
allowing to analyze memory, instructions and the programflow at runtime
44
http://pralab.diee.unica.it
Introduction to GDB
• GDB = Gnu DeBugger• It’s the most popular open source program to analyze x86/x64
executables• Works on Linux, Windows, OSX• A lot of functionality!• Allows to stop the execution of the program at a specific
instruction (breakpoints) – You can analyze memory and registers
• It also allows to set up conditional breakpoints, which are subjected to the occurrence of certain events
• GDB allows to spot bugs in a program (or to exploit them to ouradvantage)
45
http://pralab.diee.unica.it
Using GDB
• Let’s go back to sum_number• gdb sum-number• Type run to execute the program• From objdump, we see that the function starts from 0x804844d• With the x/i command potete vedere l’istruzione ad un certo indirizzo• If you type x/i 0x804844d you can see the next instruction to execute• Let’s see what happens inside the function «sum»• The address of the first instruction is 0x0804844d• break *0x0804844d
– DO NOT FORGET THE ASTERISK• run• The execution is stopped BEFORE RUNNING THE INSTRUCTION• (Warning: the following slides will continue the execution, so do not stop
the execution)
46
http://pralab.diee.unica.it
X Command
• Very powerful command!• X show the content of the memory basing on a certain type of
representation (for instance, you can represent a sequence of bytes asistructions or keep them as bytes)
• If you type x/ni you can see n instructions from a specific address...• If you type x/nb you can visualize n bytes starting from the lowest part of
the block...– Example: x/4b $(ebp+4) shows the address of the return function (the
caller of «sum»), given by: 0x08048480– This is how it appears: 0x80 0x84 0x04 0x08 (THE LEAST SIGNIFICANT BYTE
IS ON THE LEFT)• It is more effective to visualize data with words
– x/w $(ebp+4) shows the same result as word, starting from the mostsignificant byte
• DO NOT ONLY USE x (without slash), as it will use the viewing style of its last call
47
http://pralab.diee.unica.it
Memory Analysis with GDB
• We can obtain information on the loaded stackframes• Type frame• There is only one available frame (the one of «sum»)• Select the frame with f 0• info f
– Shows all the information on the needed registers– Shows the current ebp, the previous ebp and the return address (saved eip)
• Let’s see the register contents!• info registers ebp
– Shows the value of EBP (THE NEW EBP HAS NOT BEEN UPDATED YET, SO YOU ONLY SEE THE LATEST ONE)
• info registers esp– Current pointer to the stack
48
http://pralab.diee.unica.it
Memory Analsysis with GDB
• GDB can show the memory content• Let’s see what we can find in the location pointed by esp• info registers esp returns «0xbffff04c»• So type x/w 0xbffff04c• The result is «0x08048480», which is the address of the instruction after
the call to sum– Esp is therefore pointing to the location that contains the return
address– This is correct, as we still have to push the new ebp to the stack
• Let’s go on, instruction by instruction– Use the command ni
• Use it now for three times– Sometimes, it is possible to see instructions with some references to
the original variables used by the programmer– This is because the program contains «debugging» information
49
http://pralab.diee.unica.it
Memory Analysis with GDB (2)
• The sum function sums two parameters and store the results in a new variable
• Sum parameters are stored in the eax and edx registers– «mov 0xc(%ebp), %eax», «mov 0x8(%ebp), %edx»
• How to retrieve the values stored in eax ed edx?• First way:
– info registers ebp -> 0xbffff048– x/b 0xbffff048+(0xc) -> YOU CAN READ MEMORY ALSO AT SPECIFIC
OFFSETS! J -> You get 5 (the SECOND parameter)– x/b 0xbffff048+(0x8) -> You get 4 (the FIRST parameter)
• Second way:– Type ni two times– info registers eax, info registers edx
• Third way:– print a and print b, as the function takes as input a and b– Works ONLY if there are debugging information available...J
50
http://pralab.diee.unica.it
Summing up …
• You learnt many things from this lecture• Linux executables structure• Loading Linux executables to memory• Analyzing a Linux executable, by using the fundamentals of
assembly x86 and two analysis techniques:– Static analysis– Dynamic analysis
• Next question is: what if an attacker is able to exploit suchinformation to his advantage?
• Stay tuned for the next lesson!
51