DIGITAL FORENSIC RESEARCH CONFERENCE
Automatic Classification of Object Code
Using Machine Learning
By
John Clemens
Presented At
The Digital Forensic Research Conference
DFRWS 2015 USA Philadelphia, PA (Aug 9th - 13th)
DFRWS is dedicated to the sharing of knowledge and ideas about digital forensics research. Ever since it organized
the first open workshop devoted to digital forensics in 2001, DFRWS continues to bring academics and practitioners
together in an informal environment. As a non-profit, volunteer organization, DFRWS sponsors technical working
groups, annual conferences and challenges to help drive the direction of research and development.
http:/dfrws.org
Automatic Classification of Object Code UsingMachine Learning
Architecture and Endianess
John Clemens
University of Maryland Johns Hopkins UniversityBaltimore County (UMBC) Applied Physics Laboratory (JHUAPL)Baltimore, Maryland Laurel, Marylandclemej1 at umbc.edu john.clemens at jhuapl.edu
DFRWS 2015
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 1 / 14
Motivation
Reverse engineering and malware analysis e↵orts are extremely laborintensive and require expert domain knowledge
Enterprise computing environments and networks are diverseI Types of systems: Laptops, Phones, Routers, IoT ...I Within each individual system
Reverse engineering e↵orts are often ”black box” tasks where nothingis known about the underlying system
Analysts are looking for tools and techniques to jumpstart this analysis
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 2 / 14
Motivation
Machine Learning using byte histograms has already proven useful toclassify general file types / file fragments (McDaniel and Heydari, 2003,among many others).
Can we do better, and start to categorize within these general types?
Start with compiled object code
Very important for both malware analysis and reverse engineering
Enterprise diversity means that the analyst will likely encountermultiple types of object code
Accurate disassembly is crucial
Classification targets: Architecture and Endianess
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 3 / 14
Motivation
Computers are more than just the CPU
Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400
I Intel GMA4500I ATI Radeon HD 3400I Broadcom WifiI Intel NICI ACPI Embeded ControllerI SSD ControllerI ....
Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processorI Battery controllerI ....
Image credit: Wikipedia
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14
Motivation
Computers are more than just the CPU
Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500
I ATI Radeon HD 3400I Broadcom WifiI Intel NICI ACPI Embeded ControllerI SSD ControllerI ....
Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processorI Battery controllerI ....
Image credit: Wikipedia
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14
Motivation
Computers are more than just the CPU
Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400
I Broadcom WifiI Intel NICI ACPI Embeded ControllerI SSD ControllerI ....
Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processorI Battery controllerI ....
Image credit: Wikipedia
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14
Motivation
Computers are more than just the CPU
Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400I Broadcom Wifi
I Intel NICI ACPI Embeded ControllerI SSD ControllerI ....
Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processorI Battery controllerI ....
Image credit: Wikipedia
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14
Motivation
Computers are more than just the CPU
Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400I Broadcom WifiI Intel NIC
I ACPI Embeded ControllerI SSD ControllerI ....
Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processorI Battery controllerI ....
Image credit: Wikipedia
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14
Motivation
Computers are more than just the CPU
Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400I Broadcom WifiI Intel NICI ACPI Embeded Controller
I SSD ControllerI ....
Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processorI Battery controllerI ....
Image credit: Wikipedia
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14
Motivation
Computers are more than just the CPU
Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400I Broadcom WifiI Intel NICI ACPI Embeded ControllerI SSD Controller
I ....
Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processorI Battery controllerI ....
Image credit: Wikipedia
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14
Motivation
Computers are more than just the CPU
Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400I Broadcom WifiI Intel NICI ACPI Embeded ControllerI SSD ControllerI ....
Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processorI Battery controllerI ....
Image credit: Wikipedia
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14
Motivation
Computers are more than just the CPU
Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400I Broadcom WifiI Intel NICI ACPI Embeded ControllerI SSD ControllerI ....
Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53
I Adreno 405I Cell Baseband processorI Battery controllerI ....
Image credit: Wikipedia
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14
Motivation
Computers are more than just the CPU
Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400I Broadcom WifiI Intel NICI ACPI Embeded ControllerI SSD ControllerI ....
Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405
I Cell Baseband processorI Battery controllerI ....
Image credit: Wikipedia
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14
Motivation
Computers are more than just the CPU
Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400I Broadcom WifiI Intel NICI ACPI Embeded ControllerI SSD ControllerI ....
Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processor
I Battery controllerI ....
Image credit: Wikipedia
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14
Motivation
Computers are more than just the CPU
Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400I Broadcom WifiI Intel NICI ACPI Embeded ControllerI SSD ControllerI ....
Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processorI Battery controller
I ....
Image credit: Wikipedia
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14
Motivation
Computers are more than just the CPU
Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400I Broadcom WifiI Intel NICI ACPI Embeded ControllerI SSD ControllerI ....
Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processorI Battery controllerI ....
Image credit: Wikipedia
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14
Motivation
Malware authors are using this diversity to hide their software whereexisting countermeasures can’t reach
”..but the most interesting finding is the malwares ability to reprogramthe victims hard drives, making their implants invisible and almostindestructible...” - Kaspersky, Feb 17 2015
”..The malware they created, called BadUSB, can be installed on aUSB device to completely take over a PC, invisibly alter files installedfrom the memory stick, or even redirect the users internet tra�c...” –Wired, July 2014
”..Developers have published two pieces of malware [Jellyfish rootkitand Demon keylogger] that take the highly unusual step of completelyrunning on an infected computer’s graphics card, rather than itsCPU...” – Ars Technica, May 7, 2015
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 5 / 14
Existing Methods
Existing methods either rely on file metadata to determine architecture,look for existing signatures within the sample, or just try to disassemble itand leave it up to the analyst to determine if the output is valid. Binwalkuses the last two methods.
File metadata can be either missing or misleadingI Unadorned firmware blobsI Obfuscated/packed malwareI Incomplete file fragments or packet tracesI Embedded code in native object files
Signature detection requires prior knowledge of the architecture andcan lead to false positives
Disassembly methods can be misleading and require tools thatsupport the architecture
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 6 / 14
Proposed Method
Hypothesis
Machine learning techniques can be applied to object code directly(without including metadata) to automatically classify the object code’starget architecture and endianess
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 7 / 14
Dataset
CharacteristicsI 16785 unique ELF files
(exes/libs)I 20 architecturesI Sources include Debian
package repositories, Arduinodevelopment kits, and CUDAlibraries
ProblemsI Only one compiler (GCC)I CUDA sample sizeI Need more
embedded/micro-controllersamples
I Heavy bias towards RISC
Architecture # Samples Wordsize Endianessalpha 1,383 64-bit Bighppa 625 32-bit Bigm68k 1,296 32-bit Bigarm64 1,134 64-bit Littleppc64 823 64-bit Bigsh4 822 32-bit Littlesparc64 752 64-bit Bigamd64 965 64-bit Littlearmel 960 32-bit Littlearmhf 960 32-bit Littlei386 967 32-bit Littleia64 650 64-bit Littlemips 960 32-bit Bigmipsel 960 32-bit Littlepowerpc 992 32-bit Bigs390 649 32-bit Bigs390x 653 64-bit Bigsparc 648 32-bit Bigcuda 17 32-bit Littleavr 596 8-bit LittleTotal 16,785
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 8 / 14
Intuition
Can you tell which architectures the above two samples target?
Instructions contain two parts:
opcode: Unique to the architecture
operands: Data to be operated on
Opcode Density
Opcode Density =length of opcode
average instruction length
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 9 / 14
Feature VectorTo classify Architecture:
Extract code sections of each ELFobject
Generate a normalized bytehistogram to become a 256-entryattribute vector
To classify Endianess:
Above is insu�cient; endianess requires adjacency information lost in a bytehistogram
Look for distinctive patterns: encoding of ’1’ and ’-1’
Count the number of occurrences of ’0x0001’, ’0x0100’, ’0x↵e’, ’0xfe↵’
Use these four counts as additional endianess attributes
Experiments with 2-byte bi-grams proved resource intensive and performedpoorly compared to this method
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 10 / 14
Results: Overall Performance
Trained Model Multi-class Strategy WEKA Name Histogram Hist + Endian1-NN Inherent IBk 89.3238% 92.7256%3-NN Inherent IBk 89.8660% 94.9002%Decision Tree Inherent J48 93.2976% 98.0697%Random Tree Inherent RandomTree 87.8046% 92.9461%Random Forest Inherent RandomForest 90.4617% 96.4373%Naive Bayes Inherent NaiveBayes 92.5827% 95.8951%BayesNet Inherent BayesNet 89.5144% 92.2252%SVM (SMO) 1-vs-1 SMO 92.7256% 98.3497%Logistic Regression Inherent SimpleLogistic 93.0831% 97.9386%Neural Net Inherent MultilayerPerceptron 94.0244% 97.9565%
10-fold stratified cross validation accuracy for various models using the byte-valuehistogram alone, and the byte-value histogram augmented with heuristic-basedendianess attributes.
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 11 / 14
Results: E↵ect of Sample Size
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 12 / 14
Summary
Contributions of this work include:
A dataset of 16K samples of object code from 20 architectures
Machine learning techniques using byte histograms can automaticallyclassify object code’s target architecture
Endianess determined with high accuracy using with the addition offour extra heuristics in the feature vector
High accuracy with a small sample size
A method of automatic object code classification that does notrequire signatures, toolchain support, correct metadata, or anyprevious knowledge
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 13 / 14
Future Directions / Questions?
Future WorkI More complete dataset (send me your esoteric samples!)
FMicrocontroller / Embedded samples
FVirtual byte code (e.g. Python, Java, Dalvik/ART, CIL)
I Word-size detectionI Compiler attributionI Detect embedded code (Thumb vs. ARM)I Function / basic block boundary detectionI Plugin (IDA and friends)
I will post original dataset and updated dataset for download(location TBD)
Questions?
J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 14 / 14
Top Related