digital forensics lessons

Lessons Learned in digital forensics

Writing digital forensics(DF) tools is difficult because of the diversity of data types that needs to be processed, the need for high performance , the skill set of most users, and the requirement that the software run without crashing. Developing this software is dramatically easier when one possesses a few hundred disks of other people’s data for testing purposes. This paper presents some of the lessons learned by the author over the past 14 years developing DF tools and maintaining several research corpora that currently total roughly 30TB.

Abstract

As the field of digital forensics (DF) continues to grow, many people find themselves engaged in the once obscure practice of writing DF software.Few of today’s forensic tool developers have formal training in software development or design-many do not even see themselves as programmers . They say that they are writing “scripts” not programs.

Introduction

Meaning of digital forensics software

Software that is used to analyze disk images, memory dumps, network packet captures, program executable , office documents , web pages and container files.

1-criminal investigations2-internal investigations.

3-audits.

All of which have different standards for chain-of-custody , admissibility , and scientific validity.

The use of DF tools

Hackers hide data in several ways

In images using image watermark and steganography techniques but can be caught by artifacts , copy forge techniques and analyzing data patterns.In unallocated partitions in hard disk and remapped bad sectors or using alternate data stream (ADS) like C:\notepade file.txt:hide (here any data hidden won't be shown in the file size)

In order to delete files securely for good you need to use Gutmann algorithm for writing 35 times random patterns

Distinct Sector Hashes for Target file detection

Hashing files to check for file changesHashing sectors to discover changes in file segmentHashing algorithm depends on probability so it won't hash the whole drive because of the processing time requiredLooking for distinct hashes and repeated file patterns using Government data,Openmalware to detect malwares for known hashes.Algorithm using urn statistic problem for finding sectors that need to be inspected like finding red beans out of red and black beans of the urn.

Finding distinct and repeated hashes in hard disk sectors

Using different data structures and testing the speed for the file system

Network forensics challenges : Cloud computing challenges needed new tools

New frontiers in network intrusion starting from the firewall Emerging Network forensic areas:

Social networks Data mining

Digital imaging and data visualization

Network forensics

Applying network forensics in critical infrastructures

BotnetsWireless networks still lacking good forensic tools

Sink holes:accept,analyze and forensically store attack traffic

Installs forensic tools at layers 0-2

SCADA (Supervisory control and data acquisition) Challenges

Smart phone security challenges

Smart phone threat model showing malware spreading from the application layer to the communication and finally to the resource layer.while malware hijacks the phone resource and send multimedia messages to premium accounts

The challenge of data diversity1-processing incomplete or corrupt data.2-Why data will not validate?3-Windows inconsistencies.4-Eliminate data that are consistent.

Data Scale challenges1-The amount of data.2-Applying big data solutions to DF.

Lessons in digital forensics

Sub-linear algorithms for reading sectors

One solution to the performance bottleneck is to adopt sub-linear algorithms that operate by sampling data. Sampling is a powerful technique and can frequently find select data on a large piece of media with a high degree of precision.

But sampling cannot prove the absence of data: the only way to establish that there are no written sectors on a hard drive is to read every sector.

Temporal diversity: the never-ending upgrade cycle

Many computer users have learned that upgrades are a disruptive process that need to be carefully managed. As a result, many organizations run out-of-date operating systems and only move to newer ones when they buy new hardware.1-Upgrading forensics tools2-Software Versions to be upgraded3-Encase forensics tool4-Intelligent forensics tools

Human capital demands and limitations

1-It was found that users of DF software come overwhelmingly from law enforcement, with little or no background in computer science. They are generally deadline-driven and over-worked. 2-Examiners that have substantial knowledge in one area (e.g., NTFS semantics) will routinely encounter problems requiring knowledge of other areas (e.g., Macintosh malware) for which they have no training.3-developers also with skills like opcodes, multi-threading,Organization of processes and operating system data structures, networking and supercomputing.

Hard to recover data in realityHard to recover data from Hard diskRecovering data from hard drives typically involves decoding data that is fragmented or partially overwrittenFunding problemsThe differences between Windows Explorer and EnCase Forensic are not obvious to the uninitiated. DF is a difficult process that looks easy. This is not a good formula for continued funding.

The CSI Effect

Lessons learned managing a research corpus

This project started in 1998 and has expanded to include data from hard drives, cell phones, digital cameras and other devices. Today the corpus includes nearly a million redistributable filesdownloaded from US Government web servers, disk images from thousands of hard drives purchased around the world, and several terabytes of “realistic” scenarios manufactured by students.

Corpus management --technical issues

1-Imaging ATA drives Lesson: read the documentation for the computer that you are using. Lesson: make the most of the tools that you have and follow the technical innovations they force upon you. (Because you are dealing with hard disks with different technologies whether it stream-data processing or bulk data processing for compression , reading file fragments and data segments)

2-Automation as the key to corpus management

Needed a process for capturing the hard disk make,model, serial number. Lesson: automation is key; any process that involves manual record keeping is going to introduce inaccuracies that will be hard to detect and correct. Lesson: useful data will outlive the system in which it is stored, so make provisions to move the data when you design the system.

3-Evidence file formats(customer container file)

Trying to use his own container files did not work well and he had to use standard containers from programs like Encase and FTK.

Lesson: avoid developing new file formats has never been possible. Lesson: kill your darlings.

4-Crashes from bad drives

Causes of crash are many as it could be kernel memory overwritten or faulty drive or transfer to incorrect memory locations.

Lesson: many technical options remain unexplored.

5- Drive failures produce better data

Algorithm1: Developed an algorithm that reads from first sector of the disk to the last sector and then upon encountering errors, would jump to last sector of the hard drive and then repeatedly skip toward the front of the drive , read a few sectors and repeat which works for single error but does not work for multiple errors.Algorithm2: developed a disk imaging program called aimage which implemented a variety of recovery algorithms, such as attempting to repeatedly reread the problematic section; randomly seeking and reading; jumping ahead a few hundred kilobytes at each error, and reading from the last sector toward the first.

Lessons learned

Lesson: Drives with some bad sectors invariably have more sensitive information on them than drives that were in working condition when they were decommissioned.

Lesson: do research, and only to maintain software that implements a particular function when no other software is available.

6- Numbering and naming

Algorithm1: developed an algorithm that was generating files names randomly but was a waste of time. Lesson: Names must be short enough to be usable but long enough to be distinct.

When I started acquiring data outside the US I discovered that the country of origin was the most important characteristic of a disk image. I adopted a naming scheme in which the first two characters are the ISO country code, followed by a two-digit batch number, a dash, and a four digit item number. (For example, CN07-0045 is the 45th disk of the 7th batch acquired from China.) Assigning a batch number allows different individuals in the same country to assign their own numbers.

Lesson: although it is advantageous to have names that contain no semanticcontent, it is significantly easier to work with names that havesome semantic meaning.

7- Path names

• Lesson: place access-control information as near to the root of a path name as possible.

8- Anti-virus and indexing

Lesson: Configure anti-virus scanners and other indexing tools to ignore directories that might contain raw forensic data.

9- Distribution and updates

Lesson: solutions developed by other disciplines for distributing large files rarely work well when applied to DF without substantial reworking.

Corpus management–policy issues

1- Privacy issues Lesson: just because something is legal, you may wish to think twice before you do it.2- Illegal content financial, passwords, and copyrightLesson: never sell access to DF data, even if you have personal ownership.Lesson: understand Copyright Law before copying other people’s data.Lesson: make sure your intent is scientific research, not fraud, so that any collection of access devices you create does not constitute criminal activity. (credit card fraud) 3- Illegal content pornography Lesson: do not give minors access to real DF data; do not intentionally extract pornography from research corpora.4- Institutional Review BoardsLesson: While IRBs exist to protect human subjects, manyhave expanded their role to protect institutions and experimenters.Unfortunately this expanded role occasionally decreases the protection afforded human subjects. And even withthe IRB watching over you, it’s important to watch your back.

Lessons learned developing DF tools

1- Platform and language2- Parallelism and high performance computing3- All-in-one tools vs. single-use tools4- Evidence container file formats

1- Platform and language

1- The easiest way to write multi-platform tools is to write command-line programs in C, C++, C#, Java or Python, as programs written in these languages can easily transfer between the three platforms (Windows, Linux, Mac OS).2-Although C has historically been the DF developer’s language of choice, we have shifted to C++ so that we can use the STL collection and container classes.3-Java has a reputation for being slow especially for high computational applications.4-While it is easy to write programs in Python, experience to date has shown that these programs are slow and memory-intensive.

2-Parallelism and high performance computing

Multithreading and high performance computing not all the times work well because of the communications bottlenecks and a lot of times host computer processor is better than GPUs due to the I/O bottlenecks especially when processing data at many gigabytes per second.

3- All-in-one tools vs. single-use tools

My experience argues that itis better to have a single tool than many: If there are many tools, most investigators will want to have them all. Splitting functionality into multiple tools complicates tool management without providing any real benefit to practitioners. Much of what a DF tools does ---data ingest, decoding and enumerating data structures, preparing a report –is required no matter what kind of output is desired. There is a finite cost to packaging, distributing, and promoting a tool. When a tool has many functions this cost is amortized across a wider base.

4- Evidence container file formats

1-processing inputs in any format. Tools should be allowed to process inputs in any format and transparently handle disk images in raw, split-raw , Encase or AFF formats.

2-With network packets the situation is better, with pcap being the universal format.

Famous digital forensics tools

EncaseFTK

Nuix

Intilla

PTKForensics

MicrosoftCofee

Conclusion

1-Digital Forensics is an exciting area in which to work, but it is exceedingly difficult because of the diversity of data that needs to be analyzed, the size of the data sets, and the mismatch between the technical skills of users and the difficulty of the work.

2-These problems are likely to get worse over time, and our only way to survive the coming crisis is to concentrate on the development of new techniques that leverage our advantage ----the ability to collect and maintain large data sets of other people’s information.

3-in building and maintaining this corpus he encountered many problems that are increasingly relevant to others in the field. This paper describes some of the lessons that I have learned in the course my research in this area.

digital forensics lessons

Technology

government data

sampling data

corrupt data

useful data

absence of data

hard diskrecovering

different data structures

data scale challenges1