Semantically-Enabled Digital Investigations - Research Overview

A METHOD FOR SEMANTIC INTEGRATION AND CORRELATION OF DIGITAL EVIDENCE USING A HYPOTHESIS -BASED APPROACH

Semantically-Enabled Digital Investigations

by Spyridon Dosis

February 2013, Stockholm

Problem Definition

Sophisticated attacks against highly interconnected networked systems.

Multitude, variety and size of data sources with possible evidentiary value.

Need for continuous state-of-the-art technical expertise.

Evidence-oriented first-generation forensic tools with poor integration and correlation features.

Lack of common, standardized data representation/abstraction formats.

Research Questions and Limitations

How can the Semantic Web technologies and the Linked Data initiative be applied to Digital Forensics?

How a common ontological-based knowledge representation layer can improve the level of integration of currently disjoint specialized areas of DF such as storage, network, mobile, live memory and others?

How such a new method may improve the efficiency and capabilities of existing DF investigation models, techniques and tools?

Not full coverage of the features and capabilities of the Semantic Web technologies.

Simplified complexity for the conducted experiments.

Digital Evidence

“any digital data that contain reliable information that supports or refutes a hypothesis about an incident” – (Carrier & Spafford 2004)

Continuously increasing scopeVarying layers of abstraction(Schatz 2007) identifies 3 basic properties

Latency -> Semantic Interpretation Fidelity -> Chain of Custody Volatility -> Order of Volatility

Digital Investigations

The set of principles and methods that are followed during the lifecycle of digital evidence with the goal of event reconstruction.

Slight definition variations among different contexts.

The Event-based Digital Forensic Investigation Framework (Carrier & Spafford 2004) System Preservation, Evidence Searching, Event

ReconstructionThe Digital Investigation Process (Casey

2004)The Hypothesis-based Approach (Carrier

2006)

Semantic Web Technologies

“… information is given well-defined meaning, better enabling computers and people to work in cooperation” – (Tim Berners Lee 2001)

Metadata – Annotation of data providing contextual or domain-specific information about the content

Ontology – “explicit and formal specification of a conceptualization” – (Gruber 1993) Entities, attributes, interrelationships

Open world assumptionReasoning over data by inferencing implicit

conclusions

Semantic Web Architecture : Part A

adapted from Antoniou & Van Harmelen 2004

• URI/IRI enables unique identification of a resource under a global scope.

• XML provides a consistent machine-consumable data encoding scheme in an unambiguous scoped manner.

• XML Schema used for defining the rules and the ‘tag’ vocabulary that data must conform against.

• RDF provides a simple but flexible data model for encoding metadata• Subject-Predicate-Object

• RDF Schema used for defining RDF vocabularies• Class and Property

hierarchies

Semantic Web Architecture : Part B

adapted from Antoniou & Van Harmelen 2004

• OWL 2 is a computational logic-based language that enables automated reasoning for inferencing and consistency verification.

• Increased expressivity• Property Restrictions• Class and Property

Equivalency• Property Relationships• Global Cardinality Constraints

and Individual Identity (no unique-names assumption)

• OWL Dialects for varying levels of expressiveness and computational complexity.

• SWRL supports more advanced reasoning cases.

• SPARQL is an RDF-based query language and protocol

Previous Work #1

XML-based Approaches Digital Forensics XML (Garfinkel 2009) for describing disk

images and their contents (partitions, files, byte runs). EDRM XML for describing electronic document metadata. XIRAF for XML-based extraction, storage and querying of

evidence files. DEX for including provenance-related metadata. Other domain-specific XML approaches for live forensics,

network forensics, vulnerability assessment, logs, malware.Support a level of tool interoperability and

standarizationNo support for automated reasoning or semantic

integration of data.

Previous Work #2

RDF-based Approaches AFF forensic format uses RDF for including arbitrary metadata

(system or process-related, user-specific ones) Strengthening the chain-of-custody by additional RDF metadata

(evidence-access, examiner or artifact-related information) (Giova 2011)

Ontological Approaches FORE (Schatz 2004) comprised of a log parser, a forensic ontology

and a custom rule language for aggregating lower level events into higher level ones. Later expanded by referencing external ontologies.

DIALOG conceptualized ‘procedural’ and ‘practical’ aspects of a digital investigation with practical examples of registry analysis. Later expanded with additional concepts for encoding forensically relevant types of data.

(Saad 2010) applied an ontology in the network forensics area for modeling network attacks and supporting different types of reasoning based on collected events

Methodology

Two main research paradigms in IT (Hevner 2004) Behavioural Science Design Science

Outcomes of a design science process can be: Constructs Models Methods Instantiations

Design Science Method

adapted from Johannesson & Perjons 2012

• Problem Specification• Literature Review• Case studies• Empirical Observations

• Artifact Outline and Requirements• Literature Review• Case Studies

• Design and Development• Artifact Demonstration

• Laboratory Experiment (Simulated cases)

• Artifact Evaluation• “ex ante evaluation”

• Communication of the artifact

A Semantic Web approach for Digital Investigations

Information Integration Common identifiers

Different identifiers

A Semantic Web approach for Digital Investigations

Semi-structured Data Support

Classification and InferenceExtensibilityProvenance

Named GraphsSearch

Relation to Digital Investigation Reference Models

• Conceptual Mapping between the Semantic Web architecture and digital investigation frameworks

• Previous phases are assumed as prerequisites

Evaluation Criteria

Goal – Question – Metric (GQM) approachGeneric Criteria

Goal Questions MetricThe proposed method should be appropriate for the task in hand

What is the relationship of the proposed method with existing digital investigations practices and tools?What are the case context requirements for the method to be applied?

The ability of the method to handle different types of cases (network-related events, media devices examination etc.) measured by the number of different data types it can process.

The method should provide good support for decision-making by providing relevant and usable results.

What are the types of new knowledge that such the method can extract and what is its usefulness.How can the examiner formulate and evaluate hypotheses about the evidence files and receive informative results

The ability of the method to support arbitrary queries and provide answers over the whole body of collected evidence. This can be quantified by the precision and recall information retrieval measures over the query results.

The method should be cost effective in terms of storage and time needs

How the method accepts and stores input data, intermediate and final results. What are the storage requirements for such an implementation?How much time is needed for applying the method on the input data and how can it reduce the time that the investigation process takes?

Storage size requirements for representing input and output data.Time needed for performing the analysis of data or evaluating user-submitted queries.

The method should be flexible and scalable

Can the method deal with new sources of data or being able to seamlessly integrate new forms of ontologically-expressed knowledge and rules.Can the method support large amounts of data and what problems such complexity may incur?

The ability of the method to process new data and accept additional ontologies or rules without the need of major (possibly even none) modifications on the existing steps. It can be measured by the amount of configuration or code modifications such changes may require.The method’s ability to handle large amounts of data. It can be measured by the amounts of input size in relation to the processing time or produced errors (e.g. number of captured network packets, firewall logs, disk image sizes etc.)

Evaluation Criteria

Forensic CriteriaGoal Questions Metric

The method’s results should be reproducible

Are the results of the method behave in a deterministic manner when applied on the same input data or they are inconsistent among multiple tests?

The method’s results (e.g. inferred axioms, query results) should be the same given the same dataset and independently of other factors like order of processing the evidence files. This can be measured by the number of errors or different results after multiple applications of the method on the same dataset.

The method’s possible errors should be minimal and determined

Does the method produce accurate results? Can the method accept inconsistent or malformed input data? How the method deals with incomplete data? Can the method produce results that are ambiguous or inconsistent to the specified ontologies?

The method’s results can be automatically checked by a reasoning engine for possible inconsistencies between asserted and inferred axioms and the given ontologies. The method’s error rate can be measured by the error messages produced during its lifecycle.

The method must provide logging capabilities for the inclusion of arbitrary metadata regarding the case, the entities and the evidence objects involved.

Does the method support the addition of annotation axioms with respect to the asserted or inferred axioms?Does the method allow the logging of the various steps of it as they are applied and their results produced?

The ability to insert logging information during the method can be measured by its flexibility to accept arbitrary metadata.

The method should protect the integrity of the collected data

Can the method operate on forensic copies of the collected evidence? Does the method use hashing algorithms in order to ensure the consistency and integrity of these forensic copies?

The method should protect the integrity of the collected data, files and devices throughout its whole lifecycle by being able to work on forensic copies instead of the original and verify any hash values that these copies carry as forensic metadata. The ability of performing these checks for different data sources can be considered as a metric.

Evaluation Criteria

Semantic Web Related CriteriaGoal Questions Metric

System Heterogeneity – Platform Independence

Can parts of the method be applied in different system and the partial results later recombined? Are there any restrictions with respect to the configuration of these analysis systems?

The ability of the method to be successfully applied in different system configurations can be measured through multiple tests in different systems.

Implementable with the current Semantic Web Stack

Can the method’s steps that utilize Semantic Web concepts be implemented with current technology or other improvements/extensions are needed?

The method should be able to rely on existing Semantic Web technologies without the need to develop or improve their current status. Errors produced or modifications needed when implementing the proposed method can be considered a metric of how much implementable the method currently is.

The method and its results should be semantically rich allowing the description of high level contexts and events along with their interrelationships.

Can the method describe arbitrary data? Can the method accept descriptions of high level and user-defined concepts and associate set of lower level events into them? Can the method establish relationships between these higher level descriptions?

The method should be able to accept user-defined high level concepts and associate lower level events to them using well defined rules/restrictions. Errors produced or inability to define custom-defined events can be considered as a metric of how semantically rich the method is.

Description of the Method

Design structure of the method

The Data Collection phase assumes proper acquisition techniques and possible pre-processing tasks.

Ontological representation based on light-weight domain specific ontologies to the RDF data model.

Automated Reasoning for inferencing new axioms (class, property, inverse property assertion axioms).

Rule evaluation / integration with rule engines.Integrated query against the set of asserted and

inferred axioms.

Ontological Representation of Evidence

Two types of data Case Related Data

Storage Media Forensic Images, Network Packet Captures, Firewall Logs

Supportive Data WHOIS domain information, IP geo-location, IP to ASN

mappings, databases of malicious files or hosts

Lightweight ontologies have been specified with the Protégé Ontology Editor based on PCAP Network Captures, Disk Images, Windows XP

Firewall Logs, WHOIS RIPE Database, VirusTotal, FIRE malicious networks tracker

Ontologies

Network Capture Protocol stack

reconstruction Focused on HTTP W3C ERT RDF

vocabulary for HTTPForensic Disk

Image DFXML and fiwalk Timestamps, hash

values, file type

Ontologies

Windows XP Firewall Log W3C Extended

Log File Format

RIPE WHOIS RIPE NCC web

interface XML/JSON formatted

results

Ontologies

Malicious Networks FIRE project

(Wombat EU FP7) Aggregation from

sources like Anubis, Wepawet,

SpamCop, PhishTank Web interface (Discontinued)

Malware Detection VirusTotal provides a

web interface to a varietyof antimalware engines

Database search web interfacebased on hash values

Semantic Integration of Evidence

URI Format urn://<source_id>/<resource_ID>

Ontological representation Natively supported / Semantic Parsers

De-duplication Single URI resource representation under the same

namespace owl:sameAs for same resource / differently namespaced URIs

OWL 2 hasKey SWRL rules for integrating individuals in different ontologies

Realistic (manual) approach Integration ontology (IP address, MD5 hash value)

PacketCapture : IPAddress

WindowsXPFirewallLog : Host

PcapIPToFWLogHost

Semantic Correlation of Evidence

Establishing relations between resources of different nature.

Temporal Correlation SWRL Temporal Ontology (Connor & Das 2011) Support for time instants and intervals Two approaches

Modify existing ontologies byimporting the time ontology.

Specifying existing classes as subclasses of ‘ExtendedProposition’ in an external ontology.

Semantic Correlation of Evidence

Temporal Correlation (Cont’d) Relations between time intervals

Allen’s Interval Algebra (Allen 1983) Relations between time instants

and intervals ‘inside’,’before’,’after’ (Hobbs 2004)

SWRL builtinsMereological Correlation

‘partOf’ relations Transitivity E.g. IP address (partOf) IP range (partOf) AS =>

IP address (partOfAS) AS

Integrated Query Formulation and Evaluation

Two methods of querypreparation Precomputing inferred axioms Back-propagation

Two methods of queryevaluation Merging ontologies Named graphs

(Distributed SPARQL Endpoints)

A Reference Implementation

Tools Used Java 6 Protégé 4.1.0 OWL API 3.2.4 Pellet 2.3.0 Protégé OWL API 3.4.8 Jena 2.6.4 Jess 7.1p2 Kraken Pcap API 1.3.0 Apache HTTP Components, Jsoup, JSON

A Reference Implementation

• Evidence Manager• Load evidence files

• Semantic Parser• 6 parsers• Filtering options (NIST

NSRL)can lead to 40-50% reduction of an XP image.

• Collector Objects• Reduce complexity• Coupled with parsers

• Inference Engine• Class Assertion• Inverse Property Assertion

• Integration Ontology• Investigator-specific

classes/properties• SWRL Rule Engine• SPARQL In-memory endpoint

Experimental Setup

2x HP Compaq 8000 Elite Intel Core 2 Duo E8400 Processor 4 GB RAM Microsoft XP SP 3 Backtrack 5 R1

MS11-006 Vulnerability in Windows Shell Graphics Processing Office documents in thumbnail mode

Analysis Workstation Dell XPS 15 Intel Core i7 4GB RAM

Experiment A

Experiment B

Results (Experiment A)

CompromisedSystem.xml (Fiwalk output of the system’s disk image)Original Disk Size 25GBOriginal Fiwalk XML output File Size 9,46MBRDF/XML Serialization File Size 7,08MBNumber of Allocated Files in the Disk 6610Number of Nodes in the Graph Representation 34012Number of Edges in the Graph Representation 83032Network Packet Capture (filtered for the system’s IP address and TCP protocol only)Original File Size 454KBRDF/XML Serialization File Size 662KBNumber of TCP sessions 40Number of Nodes in the Graph Representation 1616Number of Edges in the Graph Representation 5891Windows XP Firewall Log of the compromised systemOriginal File Size 38KBRDF/XML Serialization File Size 684KBNumber of Log Entries 413Number of Nodes in the Graph Representation 1344Number of Edges in the Graph Representation 5866RIPE NCC WHOIS DatabaseRDF/XML Serialization File Size 210KBNumber of Queried IP Addresses 37Number of Nodes in the Graph Representation 137Number of Edges in the Graph Representation 395FIRE Malicious Networks DatabaseRDF/XML Serialization File Size 113KBNumber of Queried Autonomous Systems 5Number of Nodes in the Graph Representation 384Number of Edges in the Graph Representation 1083VirusTotal Anti-Malware Web ServiceRDF/XML Serialization File Size 2,45MBNumber of Queried and Indexed by VT Files 2304Number of Nodes in the Graph Representation 11519Number of Edges in the Graph Representation 18508

Results (Experiment A)

Reasoning Engine 72130 inferred axioms (approx. 6.1 MB)

SWRL Engine 160 ‘bridging’ properties

PacketCapture:hasIPValue(?x,?y) ^ WindowsXPFirewallLog:hasAddress(?w,?z) ^ swrlb:stringEqualIgnoreCase(?y,?z) -> IntegrationOntology:PcapIPToFWLogHost(?x,?w)

39610 time-related re-mapping properties DigitalMedia:File(?x) ^ DigitalMedia:hasFileModificationTime(?x,?y) ^

temporal:ValidInstant(?z) ^ temporal:hasTime(?z,?w) ^ swrlb:stringEqualIgnoreCase(?y,?w) ^ swrlx:makeOWLThing(?filemodificationevent,?x) -> IntegrationOntology:FileModificationEvent(?filemodificationevent) ^ IntegrationOntology:Event(?filemodificationevent) ^ temporal:hasValidTime(?filemodificationevent,?z)

Results (Experiment B)CompromisedSystem.xml (Fiwalk output of the system’s disk image)Original Disk Size 25GBOriginal Fiwalk XML output File Size 9,34MBRDF/XML Serialization File Size 6,44MBNumber of Allocated Files in the Disk 3273Number of Nodes in the Graph Representation 16330

Number of Edges in the Graph Representation 45039

Network Packet Capture (filtered for the system’s IP address and TCP protocol only)Original File Size 2,63MBRDF/XML Serialization File Size 2MBNumber of TCP sessions 57Number of Nodes in the Graph Representation 5419


Windows XP Firewall Log of the compromised systemOriginal File Size 46KBRDF/XML Serialization File Size 784KBNumber of Log Entries 480Number of Nodes in the Graph Representation 1510


RIPE NCC WHOIS DatabaseRDF/XML Serialization File Size 38KBNumber of Queried IP Addresses 41Number of Nodes in the Graph Representation 181


FIRE Malicious Networks DatabaseRDF/XML Serialization File Size 113KBNumber of Queried Autonomous Systems 5

Number of Nodes in the Graph Representation 384


VirusTotal Anti-Malware Web ServiceRDF/XML Serialization File Size 54KBNumber of Queried and Indexed by VT Files 2540

Number of Nodes in the Graph Representation 253


Results (Experiment B)

Additional Temporal Rules temporalBefore between

Time Instants Time Intervals Time Instants and Time Periods Time Periods and Time Instants

temporalStarts temporalInside

1024 ValidInstant individuals21 ValidPeriod individuals58854 inferred temporal relations

Example Hypotheses - Queries

Hypothesis

The investigator hypothesizes that the compromised system may have had network communications with external IP addresses that belong to autonomous systems that may be listed as malicious networks.

Query SELECT ?tcpflow ?destipvalue ?netname ?asnumber ?host_fire WHERE { ?tcpflow packetcapture:hasDestinationIP ?destip . ?destip packetcapture:hasIPValue ?destipvalue . ?destip integration:PcapIPToWHOISIpAddr ?whoisip . ?whoisip whois:isContainedInRange ?range . ?whoisip integration:WHOISIpAddrToFireIPAddr ?fireip . ?fireip fire:IPbelongsToHost ?host_fire . ?host_fire rdf:type fire:MaliciousHost . ?range whois:hasRange ?rangeValue . ?range whois:isContainedInAS ?as . ?as whois:hasNetName ?netname . ?as whois:hasASNumber ?asnumber . ?as whois:hasRoute ?route }

Results tcpflow destipvalue netname asnumber

<urn://bind_tcp_FWed_tcp.pcap#tcpSession_6>

"78.46.173.193" ^^<http://www.w3.org/2001/XMLSchema#string>

"HETZNER-AS" ^^<http://www.w3.org/2001/XMLSchema#string>

"24940" ^^<http://www.w3.org/2001/XMLSchema#string>

<urn://bind_tcp_FWed_tcp.pcap#tcpSession_4>

"78.46.173.193" ^^<http://www.w3.org/2001/XMLSchema#string>

"HETZNER-AS" ^^<http://www.w3.org/2001/XMLSchema#string>

"24940" ^^<http://www.w3.org/2001/XMLSchema#string>

Interpretation

The results of the query support the hypothesis that the compromised system had indeed network communications with IP addresses that belongs to autonomous systems known to demonstrate malicious behavior. The query is able to match a graph pattern in the provided dataset thus retrieving additional information regarding the specific blacklisted AS.

Evaluation

The method can be relevant to a lot of different cases due to its ability to deal with heterogeneous data.

Ability to formulate complex and expressive queries over the integrated data that match closely logical hypotheses

Efficient data abstraction and query evaluation, given axiom pre-inference Inverse object properties can improve considerably query

evaluation timeEvidence-neutral implementation

Temporal correlation can be computationally demanding

Evaluation

Reliance to online source may affect the precision of the results.

Ontological consistency of the results given valid ontologies.

The implementation can be system-independent.

Ontologies can be dynamically expanded or new ones (case-specific) introduced.

Semantically-Enabled Digital Investigations - Research Overview

Technology

Transcript of Semantically-Enabled Digital Investigations - Research Overview