Machine Learning Rob Schapire Princeton Avrim Blum Carnegie Mellon Tommi Jaakkola MIT.
FiG: Automatic Fingerprint Generation Shobha Venkataraman Joint work with Juan Caballero, Pongsin...
-
Upload
scot-parrish -
Category
Documents
-
view
216 -
download
0
Transcript of FiG: Automatic Fingerprint Generation Shobha Venkataraman Joint work with Juan Caballero, Pongsin...
FiG: Automatic Fingerprint Generation
Shobha Venkataraman
Joint work with Juan Caballero, Pongsin Poosankam, Min Gyung Kang,
Dawn Song & Avrim Blum
Carnegie Mellon University
2
Fingerprinting
Linux Solaris
Windows XP SP2
Windows XP SP1Network administrator
Used to identify: versions of software on hosts operating systems of hosts hosts running versions with vulnerabilities
3
Fingerprint:
set of queries sent to host + classification function analyzing queries &
responses
Well-known fingerprinting tools: nmap, fpdns
The Fingerprinting Process
Queries
Responses
Output: what OS?(e.g. Linux)
Host Fingerprinting Tool
4
Finding Fingerprints How do fingerprinting tools get
fingerprints?
Existing approach: Manual identification Incomplete, time-consuming Difficult to keep up-to-date
Fingerprinting
Tool
What classification
function?What
queries?
Need automatic, accurate fingerprint generation!
5
Our Contribution: FiG
In particular: Use machine learning to automatically
generate fingerprints Automatically generate accurate
fingerprints: Distinguishing OS Distinguishing implementations of DNS servers Finding new fingerprints
Demonstrate automatic fingerprint generation is possible
6
Outline
Fingerprint Generation Problem Overview of Approach Automatic Fingerprint Generation Experimental Results Conclusion
7
Fingerprint Generation Problem
Goal: find fingerprints, i.e. Useful queries Classification function
that distinguishes implementations
FingerprintGenerator
Linux
Windows XP
Solaris
Fingerprints
Fingerprinting Tool
8
Outline
Fingerprint Generation Problem Overview of Approach Automatic Fingerprint Generation Experimental Results Conclusion
9
FiG: Overview of Approach
QueryExploration
Learning
Fingerprints
Candidate Queries
FiG: Automatic Fingerprint Generation
Fingerprinting Tool
Query exploration: Generate candidate queries Learning: Automatically find fingerprints
10
FiG: Overview of Approach
QueryExploration
Learning
FingerprintsCandidate
Queries
FiG: Automatic Fingerprint Generation
Fingerprinting Tool
11
Query Exploration Goal: generate candidate queries
query: specially crafted packet sent to host
Infeasible to generate all possible queries All queries = all possible byte combinations of packet
header e.g., 40 bytes of TCP & IP header => 2^320 queries!
Instead, use protocol semantics to design queries
12
Query Exploration Queries: packets with unusual values in fields of header
Explore unusual values for fields independently
Explore fields with rich semantics exhaustively i.e., all possible values e.g., TCP flags
Explore other fields selectively i.e., some valid, invalid values e.g., tcp checksum, tcp src port
13
FiG: Overview of Approach
QueryExploration
Learning
FingerprintsCandidate
Queries
Fingerprinting Tool
Data Collection
Training Phase:learn potentialfingerprints
Testing Phase:test accuracyof fingerprints
14
Data CollectionData Collection
Testing Phase
Training Phase
1. Send candidate queries to hosts2. Collect responses from hosts3. Split into training & testing data
Data Collection
Testing Data
Training Data
Candidate QueriesAnd Responses
15
Training Phase
Training PhaseData Collection
Testing Phase
Goal: learn potential fingerprints from data
Intuition: different implementations differ in bytes of responses
Learn which bytes of responses distinguish
between implementations!
16
What we’re learning
1. Extract features
2. Combine features to distinguish
implementationsOutline: Features Classification functions Combining into fingerprints
Data Collection <queries, responses>
Windows
<queries, responses>Solaris
<queries, responses>Linux
Training Data
Data Collection
Testing Phase
Training Phase
17
Features Analyze only bytes of response Use both value & position of individual bytes in
response Capture this idea with position-substring
e f g
4 6
h ji
7 9
k
10 a b c d
0 3
Response byte sequence
Some example position-substrings
a b c d e f g h j ki
1 25 8
18
Classification Functions
Classification function
position-substrings
of response toquery q
Two classes of functions:1. Conjunctions2. Decision lists
Analyze each query & each implementation separately
YES(comes from Linux)
NO(does not come
from Linux)
e.g. for query q, for Linux implementation
19
Conjunctions Capture identical behaviour across all hosts
require position-substrings distinctive to Linux to appear in responses from ALL Linux hosts
if (response[4-5]==0x0000 && response[34-35]==0x16d0) then Linux else NotLinux
Positions 4-5
Linux
NotLinux
00 00 16 d0
00 04 16 d0
Positions 34-35
20
Decision Lists Need more expressivity than conjunctions Capture multiple types of behaviour within
implementation allow many sets of position-substrings, each distinctive to
implementation (e.g. Windows)if (response[34-35] == 0xffff) then Windows else if (response[34-35] == 0x40e8) then Windows else NotWindows
Windows
Windows
f f f f
40 e8
Positions 34-35
21
What we’re learning
Data Collection <queries, responses>
Windows
<queries, responses>Solaris
1. Extract features
<queries, responses>Linux
2. Combine features to distinguish
implementationsOutline: Features Classification functions Combining into fingerprints
Data Collection
Testing Phase
Training Phase
22
Binary-fingerprints Binary-fingerprint for implementation (e.g.,
Linux) is: single query + classification function: e.g., conjunction or decision list = boolean: e.g. Linux, or Not Linux?
Binary-fingerprint separates ONE implementation Learning (so far) finds binary-fingerprints
Conjunctions/decision lists of position-substrings (e.g. Linux or Not Linux? Windows or NotWindows?)
23
Multi-class Fingerprint Combine binary-fingerprints for multiple
implementations Multi-class fingerprint is:
single query + classification functions e.g. conjunctions, decision lists = implementation, e.g. Linux, Windows, Solaris,
unknown?Linux or Not
Linux?Windows or Not Windows?
Solaris or Not Solaris?
Binary-fingerprints for query q
Linux?Windows?
Solaris?unknown?
Multi-class fingerprint(for query q)
24
Training Phase Summary
Analyze responses to all queries, one at a time Use position-substrings of bytes in response Generate binary-fingerprints & multi-class
fingerprints Send these to testing phase
25
Testing PhaseData Collection
Testing Phase
Training Phase
Testing Data
Binary & Multi-class Fingerprints
Which fingerprints are accurate?
Fingerprints
Fingerprinting Tool
26
Outline Fingerprint Generation Problem Overview of Approach Automatic Fingerprint Generation
Query Exploration Phase Learning Phase
Experimental Results Experimental Setup & Data Fingerprinting Results: Binary & Multi-class
Fingerprints Examples of New Fingerprints
Conclusion
27
Experiment Setup & Data OS fingerprint generation:
3 OS: 77 Windows, 29 Linux, 22 Solaris hosts 305 different queries
DNS fingerprint generation: 5 DNS server implementations:
10 BIND8, 12 BIND9, 11 Windows Server 2003, 10 MyDNS, 11 TinyDNS hosts
96 different queries
28
Multi-class Fingerprints
OS: 66 queries with multi-class fingerprints DNS: 19 queries with multi-class fingerprints
All these are decision lists! No multi-class fingerprints with conjunctions
found Decision list has greater discriminatory power
One-query fingerprint distinguishing ALL implementations
simultaneously
29
All Fingerprints: OS
Binary-fingerprints
Lots more binary-fingerprints! Find conjunctions & decision lists in binary-
fingerprints Again, more fingerprints with more expressive decision
lists Similar results for DNS
OS Linux
Solaris
Windows
Decision list 130 98 98
Conjunction 42 53 53
One-query fingerprint distinguishing ONE
implementation from rest
Multi-class
66
0
30
Examples of New Fingerprints
Invalid value in data offset field: Windows & Solaris hosts respond when value <
5 Linux hosts do not respond
RST+ACK packets in responses: Linux & Solaris hosts set TCP Ack # to 0 Windows hosts set TCP Ack # to Ack # of query
31
Examples of New Fingerprints
Behaviour on ECN & CWR bits Linux & Windows ignore ECN & CWR bits in
queries Solaris do not ignore them (sometimes)
Behaviour of QdCount field on invalid queries
(DNS fingerprinting) Some servers copy the field value, others don’t
32
Conclusion Automatic fingerprint generation is possible
Use machine learning to identify fingerprints Generate fingerprints automatically for 2
applications: Distinguish OS Distinguish implementations of DNS servers
Find multi-class fingerprints using decision lists Discover new fingerprints for fingerprinting tools
35
Binary-fingerprints: DNS
DNS BIND8 BIND9 Microsoft
MyDNS
TinyDNS
Conjunction
0 0 22 2 9
Decision-list
33 28 32 29 41 Similar results for DNS binary-fingerprints
More fingerprints with more expressive decision list No binary-fingerprints with conjunctions for BIND8 &
BIND9
One-query fingerprint distinguishing ONE
implementation from rest
36
Related Work Active fingerprinting:
Comer & Lin ’94: Probing to find differences in TCP Padhye & Floyd ’01: compliance testing & protocol violations
Passive Fingerprinting Paxson ’97: TCP implementation with traffic traces Beverly ’04, Lippman et al ’03: classify OS Franklin et al ’06: wireless device driver fingerprinting
Tools: OS fingerprinting: Nmap, queso, Xprobe, Snacktime Passive fingerprinting: p0f, siphon
Defeating OS fingerprinting: Smart et al ’00: TCP Fingerprint scrubber Tools: Morph, IPPersonality