semantically-enabled digital investigations - research overview
TRANSCRIPT
A METHOD FOR SEMANTIC INTEGRATION AND CORRELATION OF DIGITAL EVIDENCE USING A HYPOTHESIS -BASED APPROACH
Semantically-Enabled Digital Investigations
by Spyridon Dosis
February 2013, Stockholm
Problem Definition
Sophisticated attacks against highly interconnected networked systems.
Multitude, variety and size of data sources with possible evidentiary value.
Need for continuous state-of-the-art technical expertise.
Evidence-oriented first-generation forensic tools with poor integration and correlation features.
Lack of common, standardized data representation/abstraction formats.
Research Questions and Limitations
How can the Semantic Web technologies and the Linked Data initiative be applied to Digital Forensics?
How a common ontological-based knowledge representation layer can improve the level of integration of currently disjoint specialized areas of DF such as storage, network, mobile, live memory and others?
How such a new method may improve the efficiency and capabilities of existing DF investigation models, techniques and tools?
Not full coverage of the features and capabilities of the Semantic Web technologies.
Simplified complexity for the conducted experiments.
Digital Evidence
“any digital data that contain reliable information that supports or refutes a hypothesis about an incident” – (Carrier & Spafford 2004)
Continuously increasing scopeVarying layers of abstraction(Schatz 2007) identifies 3 basic properties
Latency -> Semantic Interpretation Fidelity -> Chain of Custody Volatility -> Order of Volatility
Digital Investigations
The set of principles and methods that are followed during the lifecycle of digital evidence with the goal of event reconstruction.
Slight definition variations among different contexts.
The Event-based Digital Forensic Investigation Framework (Carrier & Spafford 2004) System Preservation, Evidence Searching, Event
ReconstructionThe Digital Investigation Process (Casey
2004)The Hypothesis-based Approach (Carrier
2006)
Semantic Web Technologies
“… information is given well-defined meaning, better enabling computers and people to work in cooperation” – (Tim Berners Lee 2001)
Metadata – Annotation of data providing contextual or domain-specific information about the content
Ontology – “explicit and formal specification of a conceptualization” – (Gruber 1993) Entities, attributes, interrelationships
Open world assumptionReasoning over data by inferencing implicit
conclusions
Semantic Web Architecture : Part A
adapted from Antoniou & Van Harmelen 2004
• URI/IRI enables unique identification of a resource under a global scope.
• XML provides a consistent machine-consumable data encoding scheme in an unambiguous scoped manner.
• XML Schema used for defining the rules and the ‘tag’ vocabulary that data must conform against.
• RDF provides a simple but flexible data model for encoding metadata• Subject-Predicate-Object
• RDF Schema used for defining RDF vocabularies• Class and Property
hierarchies
Semantic Web Architecture : Part B
adapted from Antoniou & Van Harmelen 2004
• OWL 2 is a computational logic-based language that enables automated reasoning for inferencing and consistency verification.
• Increased expressivity• Property Restrictions• Class and Property
Equivalency• Property Relationships• Global Cardinality Constraints
and Individual Identity (no unique-names assumption)
• OWL Dialects for varying levels of expressiveness and computational complexity.
• SWRL supports more advanced reasoning cases.
• SPARQL is an RDF-based query language and protocol
Previous Work #1
XML-based Approaches Digital Forensics XML (Garfinkel 2009) for describing disk
images and their contents (partitions, files, byte runs). EDRM XML for describing electronic document metadata. XIRAF for XML-based extraction, storage and querying of
evidence files. DEX for including provenance-related metadata. Other domain-specific XML approaches for live forensics,
network forensics, vulnerability assessment, logs, malware.Support a level of tool interoperability and
standarizationNo support for automated reasoning or semantic
integration of data.
Previous Work #2
RDF-based Approaches AFF forensic format uses RDF for including arbitrary metadata
(system or process-related, user-specific ones) Strengthening the chain-of-custody by additional RDF metadata
(evidence-access, examiner or artifact-related information) (Giova 2011)
Ontological Approaches FORE (Schatz 2004) comprised of a log parser, a forensic ontology
and a custom rule language for aggregating lower level events into higher level ones. Later expanded by referencing external ontologies.
DIALOG conceptualized ‘procedural’ and ‘practical’ aspects of a digital investigation with practical examples of registry analysis. Later expanded with additional concepts for encoding forensically relevant types of data.
(Saad 2010) applied an ontology in the network forensics area for modeling network attacks and supporting different types of reasoning based on collected events
Methodology
Two main research paradigms in IT (Hevner 2004) Behavioural Science Design Science
Outcomes of a design science process can be: Constructs Models Methods Instantiations
Design Science Method
adapted from Johannesson & Perjons 2012
• Problem Specification• Literature Review• Case studies• Empirical Observations
• Artifact Outline and Requirements• Literature Review• Case Studies
• Design and Development• Artifact Demonstration
• Laboratory Experiment (Simulated cases)
• Artifact Evaluation• “ex ante evaluation”
• Communication of the artifact
A Semantic Web approach for Digital Investigations
Information Integration Common identifiers
Different identifiers
A Semantic Web approach for Digital Investigations
Semi-structured Data Support
Classification and InferenceExtensibilityProvenance
Named GraphsSearch
Relation to Digital Investigation Reference Models
• Conceptual Mapping between the Semantic Web architecture and digital investigation frameworks
• Previous phases are assumed as prerequisites
Evaluation Criteria
Goal – Question – Metric (GQM) approachGeneric Criteria
Goal Questions MetricThe proposed method should be appropriate for the task in hand
What is the relationship of the proposed method with existing digital investigations practices and tools?What are the case context requirements for the method to be applied?
The ability of the method to handle different types of cases (network-related events, media devices examination etc.) measured by the number of different data types it can process.
The method should provide good support for decision-making by providing relevant and usable results.
What are the types of new knowledge that such the method can extract and what is its usefulness.How can the examiner formulate and evaluate hypotheses about the evidence files and receive informative results
The ability of the method to support arbitrary queries and provide answers over the whole body of collected evidence. This can be quantified by the precision and recall information retrieval measures over the query results.
The method should be cost effective in terms of storage and time needs
How the method accepts and stores input data, intermediate and final results. What are the storage requirements for such an implementation?How much time is needed for applying the method on the input data and how can it reduce the time that the investigation process takes?
Storage size requirements for representing input and output data.Time needed for performing the analysis of data or evaluating user-submitted queries.
The method should be flexible and scalable
Can the method deal with new sources of data or being able to seamlessly integrate new forms of ontologically-expressed knowledge and rules.Can the method support large amounts of data and what problems such complexity may incur?
The ability of the method to process new data and accept additional ontologies or rules without the need of major (possibly even none) modifications on the existing steps. It can be measured by the amount of configuration or code modifications such changes may require.The method’s ability to handle large amounts of data. It can be measured by the amounts of input size in relation to the processing time or produced errors (e.g. number of captured network packets, firewall logs, disk image sizes etc.)
Evaluation Criteria
Forensic CriteriaGoal Questions Metric
The method’s results should be reproducible
Are the results of the method behave in a deterministic manner when applied on the same input data or they are inconsistent among multiple tests?
The method’s results (e.g. inferred axioms, query results) should be the same given the same dataset and independently of other factors like order of processing the evidence files. This can be measured by the number of errors or different results after multiple applications of the method on the same dataset.
The method’s possible errors should be minimal and determined
Does the method produce accurate results? Can the method accept inconsistent or malformed input data? How the method deals with incomplete data? Can the method produce results that are ambiguous or inconsistent to the specified ontologies?
The method’s results can be automatically checked by a reasoning engine for possible inconsistencies between asserted and inferred axioms and the given ontologies. The method’s error rate can be measured by the error messages produced during its lifecycle.
The method must provide logging capabilities for the inclusion of arbitrary metadata regarding the case, the entities and the evidence objects involved.
Does the method support the addition of annotation axioms with respect to the asserted or inferred axioms?Does the method allow the logging of the various steps of it as they are applied and their results produced?
The ability to insert logging information during the method can be measured by its flexibility to accept arbitrary metadata.
The method should protect the integrity of the collected data
Can the method operate on forensic copies of the collected evidence? Does the method use hashing algorithms in order to ensure the consistency and integrity of these forensic copies?
The method should protect the integrity of the collected data, files and devices throughout its whole lifecycle by being able to work on forensic copies instead of the original and verify any hash values that these copies carry as forensic metadata. The ability of performing these checks for different data sources can be considered as a metric.
Evaluation Criteria
Semantic Web Related CriteriaGoal Questions Metric
System Heterogeneity – Platform Independence
Can parts of the method be applied in different system and the partial results later recombined? Are there any restrictions with respect to the configuration of these analysis systems?
The ability of the method to be successfully applied in different system configurations can be measured through multiple tests in different systems.
Implementable with the current Semantic Web Stack
Can the method’s steps that utilize Semantic Web concepts be implemented with current technology or other improvements/extensions are needed?
The method should be able to rely on existing Semantic Web technologies without the need to develop or improve their current status. Errors produced or modifications needed when implementing the proposed method can be considered a metric of how much implementable the method currently is.
The method and its results should be semantically rich allowing the description of high level contexts and events along with their interrelationships.
Can the method describe arbitrary data? Can the method accept descriptions of high level and user-defined concepts and associate set of lower level events into them? Can the method establish relationships between these higher level descriptions?
The method should be able to accept user-defined high level concepts and associate lower level events to them using well defined rules/restrictions. Errors produced or inability to define custom-defined events can be considered as a metric of how semantically rich the method is.
Description of the Method
Design structure of the method
The Data Collection phase assumes proper acquisition techniques and possible pre-processing tasks.
Ontological representation based on light-weight domain specific ontologies to the RDF data model.
Automated Reasoning for inferencing new axioms (class, property, inverse property assertion axioms).
Rule evaluation / integration with rule engines.Integrated query against the set of asserted and
inferred axioms.
Ontological Representation of Evidence
Two types of data Case Related Data
Storage Media Forensic Images, Network Packet Captures, Firewall Logs
Supportive Data WHOIS domain information, IP geo-location, IP to ASN
mappings, databases of malicious files or hosts
Lightweight ontologies have been specified with the Protégé Ontology Editor based on PCAP Network Captures, Disk Images, Windows XP
Firewall Logs, WHOIS RIPE Database, VirusTotal, FIRE malicious networks tracker
Ontologies
Network Capture Protocol stack
reconstruction Focused on HTTP W3C ERT RDF
vocabulary for HTTPForensic Disk
Image DFXML and fiwalk Timestamps, hash
values, file type
Ontologies
Windows XP Firewall Log W3C Extended
Log File Format
RIPE WHOIS RIPE NCC web
interface XML/JSON formatted
results
Ontologies
Malicious Networks FIRE project
(Wombat EU FP7) Aggregation from
sources like Anubis, Wepawet,
SpamCop, PhishTank Web interface (Discontinued)
Malware Detection VirusTotal provides a
web interface to a varietyof antimalware engines
Database search web interfacebased on hash values
Semantic Integration of Evidence
URI Format urn://<source_id>/<resource_ID>
Ontological representation Natively supported / Semantic Parsers
De-duplication Single URI resource representation under the same
namespace owl:sameAs for same resource / differently namespaced URIs
OWL 2 hasKey SWRL rules for integrating individuals in different ontologies
Realistic (manual) approach Integration ontology (IP address, MD5 hash value)
PacketCapture : IPAddress
WindowsXPFirewallLog : Host
PcapIPToFWLogHost
Semantic Correlation of Evidence
Establishing relations between resources of different nature.
Temporal Correlation SWRL Temporal Ontology (Connor & Das 2011) Support for time instants and intervals Two approaches
Modify existing ontologies byimporting the time ontology.
Specifying existing classes as subclasses of ‘ExtendedProposition’ in an external ontology.
Semantic Correlation of Evidence
Temporal Correlation (Cont’d) Relations between time intervals
Allen’s Interval Algebra (Allen 1983) Relations between time instants
and intervals ‘inside’,’before’,’after’ (Hobbs 2004)
SWRL builtinsMereological Correlation
‘partOf’ relations Transitivity E.g. IP address (partOf) IP range (partOf) AS =>
IP address (partOfAS) AS
Integrated Query Formulation and Evaluation
Two methods of querypreparation Precomputing inferred axioms Back-propagation
Two methods of queryevaluation Merging ontologies Named graphs
(Distributed SPARQL Endpoints)
A Reference Implementation
Tools Used Java 6 Protégé 4.1.0 OWL API 3.2.4 Pellet 2.3.0 Protégé OWL API 3.4.8 Jena 2.6.4 Jess 7.1p2 Kraken Pcap API 1.3.0 Apache HTTP Components, Jsoup, JSON
A Reference Implementation
• Evidence Manager• Load evidence files
• Semantic Parser• 6 parsers• Filtering options (NIST
NSRL)can lead to 40-50% reduction of an XP image.
• Collector Objects• Reduce complexity• Coupled with parsers
• Inference Engine• Class Assertion• Inverse Property Assertion
• Integration Ontology• Investigator-specific
classes/properties• SWRL Rule Engine• SPARQL In-memory endpoint
Experimental Setup
2x HP Compaq 8000 Elite Intel Core 2 Duo E8400 Processor 4 GB RAM Microsoft XP SP 3 Backtrack 5 R1
MS11-006 Vulnerability in Windows Shell Graphics Processing Office documents in thumbnail mode
Analysis Workstation Dell XPS 15 Intel Core i7 4GB RAM
Results (Experiment A)
CompromisedSystem.xml (Fiwalk output of the system’s disk image)Original Disk Size 25GBOriginal Fiwalk XML output File Size 9,46MBRDF/XML Serialization File Size 7,08MBNumber of Allocated Files in the Disk 6610Number of Nodes in the Graph Representation 34012Number of Edges in the Graph Representation 83032Network Packet Capture (filtered for the system’s IP address and TCP protocol only)Original File Size 454KBRDF/XML Serialization File Size 662KBNumber of TCP sessions 40Number of Nodes in the Graph Representation 1616Number of Edges in the Graph Representation 5891Windows XP Firewall Log of the compromised systemOriginal File Size 38KBRDF/XML Serialization File Size 684KBNumber of Log Entries 413Number of Nodes in the Graph Representation 1344Number of Edges in the Graph Representation 5866RIPE NCC WHOIS DatabaseRDF/XML Serialization File Size 210KBNumber of Queried IP Addresses 37Number of Nodes in the Graph Representation 137Number of Edges in the Graph Representation 395FIRE Malicious Networks DatabaseRDF/XML Serialization File Size 113KBNumber of Queried Autonomous Systems 5Number of Nodes in the Graph Representation 384Number of Edges in the Graph Representation 1083VirusTotal Anti-Malware Web ServiceRDF/XML Serialization File Size 2,45MBNumber of Queried and Indexed by VT Files 2304Number of Nodes in the Graph Representation 11519Number of Edges in the Graph Representation 18508
Results (Experiment A)
Reasoning Engine 72130 inferred axioms (approx. 6.1 MB)
SWRL Engine 160 ‘bridging’ properties
PacketCapture:hasIPValue(?x,?y) ^ WindowsXPFirewallLog:hasAddress(?w,?z) ^ swrlb:stringEqualIgnoreCase(?y,?z) -> IntegrationOntology:PcapIPToFWLogHost(?x,?w)
39610 time-related re-mapping properties DigitalMedia:File(?x) ^ DigitalMedia:hasFileModificationTime(?x,?y) ^
temporal:ValidInstant(?z) ^ temporal:hasTime(?z,?w) ^ swrlb:stringEqualIgnoreCase(?y,?w) ^ swrlx:makeOWLThing(?filemodificationevent,?x) -> IntegrationOntology:FileModificationEvent(?filemodificationevent) ^ IntegrationOntology:Event(?filemodificationevent) ^ temporal:hasValidTime(?filemodificationevent,?z)
Results (Experiment B)CompromisedSystem.xml (Fiwalk output of the system’s disk image)Original Disk Size 25GBOriginal Fiwalk XML output File Size 9,34MBRDF/XML Serialization File Size 6,44MBNumber of Allocated Files in the Disk 3273Number of Nodes in the Graph Representation 16330
Number of Edges in the Graph Representation 45039
Network Packet Capture (filtered for the system’s IP address and TCP protocol only)Original File Size 2,63MBRDF/XML Serialization File Size 2MBNumber of TCP sessions 57Number of Nodes in the Graph Representation 5419
Number of Edges in the Graph Representation 21712
Windows XP Firewall Log of the compromised systemOriginal File Size 46KBRDF/XML Serialization File Size 784KBNumber of Log Entries 480Number of Nodes in the Graph Representation 1510
Number of Edges in the Graph Representation 6794
RIPE NCC WHOIS DatabaseRDF/XML Serialization File Size 38KBNumber of Queried IP Addresses 41Number of Nodes in the Graph Representation 181
Number of Edges in the Graph Representation 326
FIRE Malicious Networks DatabaseRDF/XML Serialization File Size 113KBNumber of Queried Autonomous Systems 5
Number of Nodes in the Graph Representation 384
Number of Edges in the Graph Representation 1083
VirusTotal Anti-Malware Web ServiceRDF/XML Serialization File Size 54KBNumber of Queried and Indexed by VT Files 2540
Number of Nodes in the Graph Representation 253
Number of Edges in the Graph Representation 386
Results (Experiment B)
Additional Temporal Rules temporalBefore between
Time Instants Time Intervals Time Instants and Time Periods Time Periods and Time Instants
temporalStarts temporalInside
1024 ValidInstant individuals21 ValidPeriod individuals58854 inferred temporal relations
Example Hypotheses - Queries
Hypothesis
The investigator hypothesizes that the compromised system may have had network communications with external IP addresses that belong to autonomous systems that may be listed as malicious networks.
Query SELECT ?tcpflow ?destipvalue ?netname ?asnumber ?host_fire WHERE { ?tcpflow packetcapture:hasDestinationIP ?destip . ?destip packetcapture:hasIPValue ?destipvalue . ?destip integration:PcapIPToWHOISIpAddr ?whoisip . ?whoisip whois:isContainedInRange ?range . ?whoisip integration:WHOISIpAddrToFireIPAddr ?fireip . ?fireip fire:IPbelongsToHost ?host_fire . ?host_fire rdf:type fire:MaliciousHost . ?range whois:hasRange ?rangeValue . ?range whois:isContainedInAS ?as . ?as whois:hasNetName ?netname . ?as whois:hasASNumber ?asnumber . ?as whois:hasRoute ?route }
Results tcpflow destipvalue netname asnumber
<urn://bind_tcp_FWed_tcp.pcap#tcpSession_6>
"78.46.173.193" ^^<http://www.w3.org/2001/XMLSchema#string>
"HETZNER-AS" ^^<http://www.w3.org/2001/XMLSchema#string>
"24940" ^^<http://www.w3.org/2001/XMLSchema#string>
<urn://bind_tcp_FWed_tcp.pcap#tcpSession_4>
"78.46.173.193" ^^<http://www.w3.org/2001/XMLSchema#string>
"HETZNER-AS" ^^<http://www.w3.org/2001/XMLSchema#string>
"24940" ^^<http://www.w3.org/2001/XMLSchema#string>
Interpretation
The results of the query support the hypothesis that the compromised system had indeed network communications with IP addresses that belongs to autonomous systems known to demonstrate malicious behavior. The query is able to match a graph pattern in the provided dataset thus retrieving additional information regarding the specific blacklisted AS.
Evaluation
The method can be relevant to a lot of different cases due to its ability to deal with heterogeneous data.
Ability to formulate complex and expressive queries over the integrated data that match closely logical hypotheses
Efficient data abstraction and query evaluation, given axiom pre-inference Inverse object properties can improve considerably query
evaluation timeEvidence-neutral implementation
Temporal correlation can be computationally demanding