Discoverer: Automatic Protocol Reverse Engineering
from Network Traces
Weidong Cui
Jayanthkumar Kannan
Helen J. Wang
Microsoft Research
USENIX Security (Security ‘07)
Present by Mike Hsiao, 20080125
2
Outline
1. Introduction
2. Problem Statement– Common protocol idioms and the scope of Discoverer
3. Design
4. Evaluation
5. Related Work*
6. Limitations and Future Work
7. Conclusion and Comment
3
Application-level protocol specifications: usage
Application-level protocol specifications are useful for many security applications.– intrusion prevention and detection– deep packet inspection– protocol analyzer– penetration testing
generates network inputs to an application to uncover potential vulnerabilities
Current practice is mostly manual.
Section 1
4
Discoverer
is a tool for automatically reverse engineering the protocol message formats of an application from its network trace
operates in a protocol-independent fashion– by inferring protocol idioms commonly seen in
message formats of many application-level protocols
is then evaluated over a text and two binary protocols
Section 1
5
Application-level protocol specifications
From documentation or reverse engineered manually Time-consuming and error-prone
“It took the open-source SAMBA project 12 years to manually reverse engineer the Microsoft SMB protocol.”
“Yahoo messenger protocol has also been persistently reverse engineered, despite which, the open source clients regularly require patching to support proprietary changes in the Yahoo protocol.”
– the period between the availability of an official client and an open-source client has been a month
Section 1
6
Automaticallyreverse engineer message formats
Challenges– Very few hints from the network trace (byte streams)– Protocols are significantly different from each other– Protocol message formats are often context-sensitive
where earlier fields dictate the parsing of the subsequent part of the message
The authors dissect the formless byte streams into text and binary segments or tokens
– as a starting point for clustering messages with similar patterns, where each cluster approximates a message format.
Section 1
7
Evaluation Matrices
Correctness– does one inferred format correspond to exactly
one true format? Conciseness
– how many inferred formats is a single true format reflected in?
Coverage– how many messages are covered by the inferred
formats?
Section 1
8
Problem Statement: Common Protocol Idioms
Application session– consists of a series of messages between two hosts that
accomplishes a specific task.
Message format specification– a sequence of fields and their semantics
length, offset (byte offset of another field) pointer (an offset specifies the index of a field) cookie (session specific opaque data. E.g., session ID) endpoint-address (IP, port) set (a group of fields that can be put in an arbitrary order)
Section 2
9
Common Protocol Idioms: Format Distinguisher
Format Distinguisher (FD)– It serves to differentiate the format of the subsequent part of
the message– A message may have a sequence of FD fields, particularly
when multiple protocols are encapsulated. E.g., SMB consists of a NetBIOS header
– This implies that the applications need to scan a message from left-to-right, decoding a FD field before parsing the subsequent part of the message.
Section 2
10
Scope of Discoverer
derive the message format specification– not protocol finite state machine
assume synchronous protocols A message is a consecutive chunk of application-
level data sent in one direction one or more TCP or UDP connection
– UDP connection is a pair of unidirectional UDP flows focus on applications that do not obfuscate payloads do not capture timing semantics
Section 2
11
Design: Overview
Cluster messages with the same format together and infer the message format by comparing messages in a single cluster
1. Tokenization and Initial Clustering
2. Recursive Clustering
3. Merging
Section 3
12
13
14
1-1 Tokenization (1/2)
Text– Identify text bytes by comparing them with the
ASCII values of printable characters– Consider a sequence of text bytes sandwiched
between two binary bytes as a text segment– Require the sequence to have a minimum length– Use a set of delimiters (e.g., space and tab) to
divide a text segment into tokens
Section 3
15
1-1 Tokenization (2/2)
Binary– They simply declare a single binary byte to be a binary
token in its own right.– Error 1: consecutive binary bytes with ASCII values of
printable characters are wrongly marked as a text token– Error 2: a text string shorter than the minimum length is
wrongly marked as binary tokens– Error 3: a text field consisting of some white space
characters is wrongly divided into multiple text tokens
Section 3
16
1-2 Initial Clustering by Token Patterns
The authors cluster messages based on their token patterns.– The token pattern assigned to a message is a
tuple: (dir, class of token 1, class of token 2, …) E.g., (client to server, text, binary, text)
Note that this initial clustering is coarse-grained since messages with different formats may have the same token pattern.
Section 3
17
2 Recursive Clustering
The recursive clustering relies on identifying format distinguisher (FD) tokens
To find FD tokens, we need to invoke both format inference and format comparison
Section 3
18
2-1 Format Inference
This phase takes as input a set of messages and infers a format that succinctly captures the content of the set of messages.
Property Inference– Token class is already identified during the tokenization phase.– Constant or variable tokens can also be easily identified.– Since the set of messages come from a single token-pattern
cluster, tokens in one message can be directly compared against their counterparts by simply using the token offset.
– Thus, constant tokens are those that take the same value across the entire set of messages, and variable tokens are those that take more than one value.
Section 3
19
2-1 Format Inference
Semantic Inference– length
intuition: for a specific pair of messages, the difference in the values of potential length fields reflects the difference of the sizes of the messages
potential length: at most four consecutive binary tokens or a text token in the decimal or hex format
– offset compare the value difference with the difference of the offsets
of some subsequent tokens– cookie
operate at the end of the merging phase, RolePlayer [3]
Section 3
20
2-2 Format Comparison
Decide if two inferred message formats are the same?– token-by-token– from left-to-right
Ideally, two tokens can be considered to match if their semantics match.
Section 3
21
2-3 Recursive Clustering by Format Distinguishers
Three criteria to determine if a token is a FD1. number of unique values taken by this token across the
set of messages is less than a threshold
2. (if the 1st criteria is satisfied) Divided the cluster is into sub-clusters by using unique token value. the size of the largest sub-cluster exceeds a threshold guarantee a meaningful format inference in at least one sub-
cluster
3. (if potential FD passes 2nd phase) invoke format comparison across sub-clusters
Section 3
22
2-3 Recursive Clustering by Format Distinguishers
This process is recursively performed on each of the sub-clusters because a message may have more than one FD token.
They find the next FD token by scanning further down the message towards the right (end) of the message.
The format inference is invoked again on the set of messages in each sub-cluster.
– The inferred token properties and semantics might change because the set of messages has become smaller.
Section 3
23
3 Merging with Type-Based SequenceAlignment
In previous phases, we are conservative to ensure that the format inference procedure operates correctly on a set of messages of the same format.– this leads to a new problem of over-classification– E.g., a trace of SMB with 4M messages can come
out 7000 cluster/format, but the # of total true format is 130.
Section 3
24
3 Merging with Type-Based SequenceAlignment
Type-based sequence alignment– It only allows two tokens of the same class (binary
or text) to align with each other. They claim two aligned tokens are matched if they either
have the same semantic or share at least one value.
– Extra gap constraints
Section 3
25
An Example: true message from Ethereal
Section 3
26
An Example: the final inferredformat by Discoverer (1/2)
Section 3
27
An Example: the final inferredformat by Discoverer (1/2)
Section 3
inferred format is a sequence of tokens with token properties (binary vs. text, constant vs. variable) and semantics (e.g., length fields).
28
Evaluation
5,700 lines of C++ code on Windows un-optimized implementation takes about 6-1
2 hours for a trace of several million messages
Data Sets– a honeyfarm site (which responds to unsolicited,
mostly malicious traffic); SMB only.– a busy enterprise (which has diverse and high-vol
ume traffic); HTTP, SMB, RPC.
Section 4
29
Evaluation Methodology
Correctness– If a cluster contains messages from more than one true
format, then Discoverer will make incorrect inference.– For all three protocols, over 90% clusters contain messages
from a single true format. Conciseness
– A large number of redundant formats will affect the conciseness of the protocol specifications generated
– The ratio from the number of inferred formats to the number of true formats followed by their messages. (5:1)
– almost 80% true formats are scattered into at most five clusters.
Section 4
30
Evaluation Methodology (cont’d)
Coverage– the fraction of messages covered by our inferred
formats– the fraction of true formats followed by covered
messages– For all the three protocols, the message coverage
is above 95% while the format coverage is around 30-40%.
Section 4
31
Tunable Parameters
Section 4
32
HTTP
Section 4
The HTTP protocol allows an arbitrary number of “parameter: value” pairs in an arbitrary order.
1. most messages (more than 99%) fall in thefirst top 1000 true formats. similar trendin the RPC and CIFS/SMB.
2. they inferred 3,926 formats, which covered 5,938,511 out of 5,950,453 messages (99.8%).
3. The covered messages belong to 865 out of 2,696 true formats (32%).
33
Limitations and Future Work
Trace Dependency– message formats never occur in the trace– certain variable fields never take more than one value in the
trace Pre-Defined Semantics
– Only a set of pre-defined semantics can be inferred. Coalescing Fields
– Unlike text fields, no clue may be available in delimiting binary fields
– only few approaches (e.g., does this byte vary as much as the other one?)
Section 6
34
Limitations and Future Work (cont’d)
Asynchronous Protocols– messages in one direction may be interrupted by those in
the other direction– messages in one direction may be delayed allowing two
back-to-back messages in the other direction.
Application Sessions– Currently, Discoverer analyzes each connection in isolation.
State Machine Inference– captures the sequences of messages in all sessions in the
trace
Section 6
35
Conclusion and Comment
Discoverer is a tool that aims to automate this reverse engineering process
Protocol knowledge is very difficult to model automatically.– so far they only model the semantics (offset,
length…)– How about the communication interaction? (user
intention …)
Top Related