automated document and media exploitation

51
Automated Document and Media Exploitation Simson Garfinkel, Ph.D. Associate Professor Naval Postgraduate School http://faculty.nps.edu/slgarfin/ NPS is the Navy’s Research University. Location: Monterey, CA Campus Size: 627 acres Students: 1500 ! US Military (All 5 services) ! US Civilian (Scholarship for Service & SMART) ! Foreign Military (30 countries) ! All students are fully funded Schools: ! Business & Public Policy ! Engineering & Applied Sciences ! Operational & Information Sciences ! International Graduate Studies 2

Upload: lamngoc

Post on 05-Jan-2017

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automated Document and Media Exploitation

Automated Document and Media Exploitation

Simson Garfinkel, Ph.D.Associate ProfessorNaval Postgraduate Schoolhttp://faculty.nps.edu/slgarfin/

NPS is the Navy’s Research University.

Location: Monterey, CA

Campus Size: 627 acres

Students: 1500! US Military (All 5 services)

! US Civilian (Scholarship for Service & SMART)

! Foreign Military (30 countries)

! All students are fully funded

Schools:

! Business & Public Policy

! Engineering & Applied Sciences

! Operational & Information Sciences

! International Graduate Studies

2

Page 2: Automated Document and Media Exploitation

DEEP Lab(Digital Evaluation and ExPloitation)

Current Projects:

! High-speed AES implementation on CellBE SPUs

—6 cores at a time on Sony PS3!

! User-level ATA command generator for Linux.

—For testing ATA/SATA write-blockers

! Corpora for Forensic Research

—Real Data Corpus (2000 “images” of hard drives, SD cards, etc.)

—Realistic Corpus (constructed disk images)

! Sector Discrimination Project

—What can you say about a 4096-byte sector/cluster from a hard drive?

! A piece of a known file?

! A JPEG / HTML / ZIP / DOC / DOCX?

! Encrypted?

3

Automated Document and Media Exploitation (DOMEX):

The Need

4

Page 3: Automated Document and Media Exploitation

Law enforcement & military actives encounter lots of digital documents and media.

5

!"!#$% & % '( ) ' * & +,-% & . $ ' '

& " ' . /, ' !. .(-",$ '0(-1)(-2, ' '

1 345 ' 2("%6 # . & "5 ' 76",'89:';<<;' 3!5,',%=;'

!

!

"#$!%$&'()*$+)!,-..$(,!.(/*!'+!"#$%&'($)&!*(+$#!,&-.(,/&-!+$#$0&+&#)!"#1,$-),(/)(,&0!!

"#$($!1,!213$,&($'3!&$(4$&)1/+5!$,&$41'667!'*/+8!*1+/(1)1$,5!)#')!9:!&('4)14$,!6'4;!

)('+,&'($+470!!"#1,!($,-6),!1+!'))/(+$7,!&$(4$1<1+8!)#')!&('4)14$,!'($!-+.'1(0!!"#$!%$&'()*$+)!3/$,!

+/)!$*&#',1=$!4'($$(!3$<$6/&*$+)5!'+3!)//6,!./(!&$(./(*'+4$!'&&('1,'6!'($!3$.141$+)0!!>,!'!

($,-6)5!'))/(+$7,!41)$!&//(!?&$/&6$!*'+'8$*$+)@!A7!,-&$(<1,/(,0!

!

2&/)".#!3*"&1-!$,&!$#!&4),&+&56!/,")"/$5!&5&+&#)!/.!)#$!%$&'()*$+)B,!31<$(,1)7!461*')$0!!"#$7!

#'<$!,18+1.14'+)!'-)#/(1)7!1+!($4(-1)*$+)5!#1(1+85!&(/*/)1/+5!&$(./(*'+4$!'&&('1,'65!4',$!

',,18+*$+)5!'+3!4'($$(!3$<$6/&*$+)0!!"#$!C$4)1/+!D#1$.!2/(;./(4$!1,!+/)!31<$(,$!'+3!)-(+/<$(!1,!

6/20!!"#1,!&'))$(+5!4/*A1+$3!21)#!)#$!8$+$('667!6/2!'))$+)1/+!)#')!)#$,$!*'+'8$(,!&'7!)/!,)'..!

4'($$(!3$<$6/&*$+)5!6$'3,!*1+/(1)1$,!)/!&$(4$1<$!'!6'4;!/.!'3<'+4$*$+)!/&&/()-+1)1$,0!

!

"#$!%$&'()*$+)B,!'))/(+$7!2/(;./(4$!1,!+.,&!%"7&,-&!)*$#!)*&!8929!5&0$5!:.,;1.,/&E!!FGH!

.$*'6$5!4/*&'($3!)/!FIH!1+!)#$!J0C0!6$8'6!6'A/(!&//65!'+3!KLH!*1+/(1)75!4/*&'($3!)/!KMH!1+!

)#$!6'A/(!&//60!!"#$!%$&'()*$+)B,!'))/(+$7!2/(;./(4$!1,!'A/-)!$-!%"7&,-&!$-!)*&!1&%&,$5!

0.7&,#+&#)!5&0$5!:.,;1.,/&5!2#/,$!'))/(+$7,!'($!FGH!.$*'6$!'+3!KNH!*1+/(1)70!

!

<","#0!"-!-&,7"#0!).!+$;&!)*&!=&>$,)+&#)!&7&#!+.,&!%"7&,-&E!!#1($,!1+!MIIK!2$($!OIH!.$*'6$!

'+3!MKH!*1+/(1)70!!P+!&'()14-6'(5!)*&!?)).,#&6!@&#&,$5A-!<.#.,-!B,.0,$+!"-!$#!"+>.,)$#)!

)..5!./(!1+4($',1+8!31<$(,1)70!!9/+/(,!Q(/8('*!#1($,!1+!MIIK!2$($!NFH!.$*'6$5!4/*&'($3!)/!OLH!

/.!)#$!6'2!,4#//6!8('3-')1+8!46',,5!'+3!FIH!*1+/(1)75!4/*&'($3!)/!MKH!/.!)#$!46',,!/.!MIIK0!

!

R1+/(1)1$,!'($!-"0#"1"/$#)56!(#%&,C,&>,&-&#)&%!"#!+$#$0&+&#)!,$#;-0!!"#$7!4/*&(1,$!/+67!

SH!/.!T4'($$(U!CVC!'))/(+$7,!'+3!KKH!/.!,-&$(<1,/(7!>,,1,)'+)!J0C0!>))/(+$7,0!!W/*$+!

4/+,)1)-)$!FKH!/.!CVC,!'+3!FSH!/.!,-&$(<1,/(7!>JC>,0!!>*/+8!XCYKL!'))/(+$7,!1+!)#$!

Z1)18')1+8!%1<1,1/+,5!*1+/(1)1$,!4/*&(1,$!KKH!/.!+/+Y,-&$(<1,/(,!'+3!NH!/.!,-&$(<1,/(,5!'+3!

2/*$+!4/*&(1,$!FSH!/.!+/+Y,-&$(<1,/(,!'+3!FFH!/.!,-&$(<1,/(,0!

!

D"#.,")"&-!$,&!-(E-)$#)"$556!+.,&!5";&56!).!5&$7&!)*&!=&>$,)+&#)!)*$#!:*")&-0!!P+!MIIK5!)#$!

'))(1)1/+!(')$!2',!O[H!#18#$(!'*/+8!*1+/(1)1$,!)#'+!2#1)$,0!!"#$($!2',!+/!31..$($+4$!1+!($4$+)!

'))(1)1/+!A$)2$$+!*$+!'+3!2/*$+0!

!

"#$($!'($!'6,/!,)')1,)14'667!-"0#"1"/$#)!,$/&!$#%F.,!0&#%&,!&11&/)-!/+!'!+-*A$(!/.!9:!/-)4/*$,5!

1+46-31+8!,)'()1+8!8('3$5!4-(($+)!8('3$5!&(/*/)1/+,5!'+3!4/*&$+,')1/+0!!\/(!$]'*&6$5!)#$!'<$('8$!

*1+/(1)7!XC!'))/(+$7!1,!4-(($+)67!I0O!,)$&,!6/2$(!)#'+!)#$!'<$('8$!2#1)$5!'+3!)#$!'<$('8$!2/*'+!

1,!I0F!,)$&,!6/2$(!)#'+!)#$!'<$('8$!*'+5!4/+)(/661+8!./(!,$+1/(1)75!8('3$5!'+3!4/*&/+$+)0!

!

^',$3!/+!)#$,$!.1+31+8,5!2$!($4/**$+3!)#')!)#$!%$&'()*$+)!)';$!)#$!./66/21+8!'4)1/+,E!

!

G4&,/"-&!?@C!$#%!=?@C5&7&5!5&$%&,-*">!)/!,)($,,!)#$!1*&/()'+4$!/.!31<$(,1)7!'+3!)#$1(!

4/**1)*$+)!)/!1)0!!Q-A61467!4/**1)!)#$!%$&'()*$+)!)/!&'(1)7!A/)#!1+!31<$(,1)7!/-)4/*$,!T$0805!

4/*&'('A6$!($&($,$+)')1/+!')!'66!6$<$6,U!'+3!1+!'))1)-3$,!T$0805!_/A!,')1,.'4)1/+U!'*/+8!'66!

3$*/8('&#14!8(/-&,0!!P3$+)1.7!6$<$(,!./(!4#'+8$5!./4-,1+8!/+!>>X,!T2#/!'($!31<$(,$U!'+3!

C$4)1/+!D#1$.,0!!P*&6$*$+)!)('1+1+8!/.!6$'3$(,!)/!13$+)1.7!)#$1(!(/6$!1+!,#'&1+8!2/(;!461*')$!

1,,-$,!'+3!1+!$..$4)-')1+8!4#'+8$0!

!

!

June 2007

S M T W T F S

1 2

3 4 5 6 7 8 9

10 11 12 13 14 15 16

17 18 19 20 21 22 23

24 25 26 27 28 29 30

Most of this data is analyzed using trained personnel and off-the-shelf software.

DOMEX in Iraq

6

Page 4: Automated Document and Media Exploitation

Software is mostly COTS GUI tools designed for law enforcement.

- Designed for visibility & search.

- Does not scale to 100s or 1000s of drives.

EnCase by Guidance Software

7

Manual analysis misses opportunities for correlation.

Different analysts see different hard drives.

Keyword searches don’t connect the dots.

8

email from [email protected] to [email protected]

[email protected]

Page 5: Automated Document and Media Exploitation

Tools designed for Law Enforcement do notuse data once the investigation is over.

Bad guy goes to prison. Hard drive goes to storage.

9

"Analysts in a tent" do not scale.

10

Page 6: Automated Document and Media Exploitation

We are building tools for performing Automated DOMEX.

Unrestricted Research Corpus

! Several thousand disk & device images

New techniques & algorithms

! Designed to exploit a data rich environment.

! Designed for autonomous operation.

Legal approaches for working with data! “Realistic Data” — For tool testing & education.

! “Redistributable Real Data” —!real but unrestricted.

! “Real Data” — For advanced research.

! IRB (Institutional Review Board) approaches and protocols.

11

We are building…... an unrestricted research corpus.

Hard drives, USB sticks, Digital Cameras & cell phones:

! Purchased outside the US on the secondary market.

! Available to qualified researchers.

Multiple Corpora:

! Non-US Persons Corpus

! US Persons Corpus (at Harvard)

! “Realistic” Corpus (No PII)

12

Page 7: Automated Document and Media Exploitation

We are developing…... automated DOMEX algorithms.

Ascription:

! Batch analysis of drive contents

! Automated determination of owner & source

! Attribution of carved data

Hot Drive Identification:

! Automated triage without using keywords

! “Datasphere”—aware

Cross-Drive Analysis:! Reconstruction of social networks

.

0

200

10, 000

20, 000

30, 000

40, 000Unique CCNs

Total CCNs

Drive #215182 CCNS1356 unique

Drive #17231348 CCNS11609 unique

Drive #171346 CCNS81 unique

Supermarket

ATM

StateSecretary'sOffice

MedicalCenter

AutoDealership

SoftwareVendor

Drives #74 x #77

25 CCNS

in common

Drives #171 & #172

13 CCNS

in common

Drives #179 & #206

13 CCNS

in common

Same Community College

SameMedical Center

SameCar Dealership

13

End-to-End automated analysis can increase exploitation capabilities and connect the dots.

14

Datasphere Repository

Who? Contacts?

Anything unusual?

What was done? Anything Encrypted?

Page 8: Automated Document and Media Exploitation

The Real Data Corpus

15

The Real Data Corpus: "Real Data from Real People."

Most forensic work is based on “realistic” data created in a lab.

We get real data from CN, IN, IL, MX, and other countries.

Real data provides:! Real-world experience with data management problems.

! Unpredictable OS, software, & content

! Unanticipated faults

We have multiple corpora:

! Non-US Persons Corpus

! US Persons Corpus (@Harvard)

! Releasable Real Corpus

! Realistic Corpus

16

Page 9: Automated Document and Media Exploitation

We image physical devices to disk image files.

17

image.aff

100MB - 40GB

We image physical devices to disk image files.

17

image.aff

100MB - 40GB

Page 10: Automated Document and Media Exploitation

The images are stored in the Advanced Forensic Format (AFF)

AFF extensible schema stores:

! Data copied from disk

! Metadata (SN, date of image)

! Chain-of-custody information

AFF supports:

! Compression

! Public Key Encryption & Digital Signatures

! Incremental file transfer

AFF Adoption:

! Sleuth Kit

! Black Bag (Mac Forensics)

! Others have expressed interest

AFF is designed for automating media exploitation

Advanced Forensic Format AFF ®

18

AFF Data Schema

AFF XML

raw

AFF binary

S3ewf

VM

DK

To use the Real Data corpus, you need IRB approval

US Law requires approval of an Institutional Review Board to work with human subject data [45 CFR 46]

! Clearly describe what you want to do.

! Get submit your protocol to your local IRB.

! Provide your protocol and your approval.

! Approval must be renewed every year.

No approval required for “Realistic” data.

! http://www.digitalcorpora.org/

19

Page 11: Automated Document and Media Exploitation

Automated Exploitation:Disk Images to XML

20

There are many forensic tools.

Many are programmable

! EnCase — EScript

! PyFlag — Flash Script & Python

! Sleuth Kit — C/C++

But writing programs for these systems is hard:! Programming languages are procedural and mechanism-oriented

! Data is separated from actions on the data.

! Many of the programs are not designed for easy automation.

21

XML makes forensic analysis easier.

Page 12: Automated Document and Media Exploitation

XML is ideally suited for forensic analysis

Forensic data is tree-structured.

! Case > Devices > Partitions > Directories > Files

! Files

—file system metadata

—file meta data

—file content

! Container Files (ZIP, tar, CAB)

—We can exactly represent the container structure

—PyFlag does this with “virtual files”

—No easy way to do this with the current TSK/EnCase/FTK structure

—(Note: Container files not currently implemented.)

22

fiwalk extracts disk images into XML.

XML usage:$ fiwalk [-Xfile.xml] [-x] imagefile

! -X file.xml produces an XML file with a full DTD

! -x sends XML to standard out

Other options:

! -c config.txt —!metadata extraction plug-ins

! -C nn — only process nn files

! -S — single threaded

! -z — don’t calculate MD5 or SHA1

! -s dir — save all files in dir

! -Afile.arff — creates an ARFF file with output

! -Bfile — bloomfilter support

23

AFF

<XML>

Page 13: Automated Document and Media Exploitation

fiwalk produces three kinds of XML tags.

Per-Image tags <fiwalk> — outer tag

<fiwalk_version>0.4</fiwalk_version>

<Start_time>Mon Oct 13 19:12:09 2008</Start_time>

<Imagefile>dosfs.dmg</Imagefile>

<volume startsector=”512”>

Per <volume> tags:<Partition_Offset>512</Partition_Offset>

<block_size>512</block_size>

<ftype>4</ftype>

<ftype_str>fat16</ftype_str>

<block_count>81982</block_count>

Per <fileobject> tags:<filesize>4096</filesize>

<partition>1</partition>

<filename>linedash.gif</filename>

<libmagic>GIF image data, version 89a, 410 x 143</libmagic>

24

fiwalk XML example

<fileobject>

<filename>WINDOWS/system32/config/systemprofile/「开始」菜!/程序/附件/_rf55.tmp</

filename>

<filesize>1391</filesize>

<unalloc>1</unalloc>

<used>1</used>

<mtime>1150873922</mtime>

<ctime>1160927826</ctime>

<atime>1160884800</atime>

<fragments>0</fragments>

<md5>d41d8cd98f00b204e9800998ecf8427e</md5>

<sha1>da39a3ee5e6b4b0d3255bfef95601890afd80709</sha1>

<partition>1</partition>

<byte_runs type=’resident’>

<run file_offset='0' len='65536'

fs_offset='871588864' img_offset='871621120'/>

<run file_offset='65536' len='25920'

fs_offset='871748608' img_offset='871780864'/>

</byte_runs>

</fileobject>

25

Page 14: Automated Document and Media Exploitation

The <byte_runs> array specifies thephysical location on the disk.

One or more <run> elements may be present:

<byte_runs type=’resident’>

<run file_offset='0' len='65536'

fs_offset='871588864' img_offset='871621120'/>

<run file_offset='65536' len='25920'

fs_offset='871748608' img_offset='871780864'/>

</byte_runs>

This file has two fragments:! 64K starting at sector 1702385 (871621120 ÷ 512)

! 25,920 bytes starting at sector 1702697 (871780864 ÷ 512)

Additional XML attributes may specify compression or encryption.

26

fiwalk has a plugable metadata extraction system

Metadata extractors are specified in the configuration file*.jpg dgi ../plugins/jpeg_extract

*.pdf dgi java -classpath plugins.jar Libextract_plugin

—Currently the extractor is chosen by the file extension

—fiwalk runs the plugins in a different process

—We have specs for a native JVM interface which uses IPC and 1 process.

Metadata extractors produce name:value pairs on STDOUT

Manufacturer: SONY

Model: CYBERSHOT

Orientation: top - left

fiwalk incorporates into XML:<fileobject>

...

<Manufacturer>SONY</Manufacturer>

<Model>CYBERSHOT</Model>

<Orientation>top - left</Orientation>

...

</fileobject>

27

Page 15: Automated Document and Media Exploitation

fiwalk.py: a Python module for automated forensics.

Key Features:

! Automatically runs fiwalk with correct options if given a disk image

! Reads XML file if present (faster than regenerating)

! Creates fileobject objects.

Multiple interfaces:

! SAX callback interfacefiwalk_using_sax(imagefile, xmlfile, flags, callback)

—Very fast and minimal memory footprint

! SAX procedural interfaceobjs = fileobjects_using_sax(imagefile, xmlfile, flags)

—Reasonably fast; returns a list of all file objects with XML in dictionary

! DOM procedural interface(doc,objs) = fileobjects_using_dom(imagefile, xmlfile, flags)

—Allows modification of XML that’s returned.

28

The SAX and DOM interfaces both return fileobjects!

Our Python fileobject class is an easy-to-use abstract class for working with file system data.

fileobject_sax(fileobject)! — for the SAX interface

fileobject_dom(fileobject)! – for the DOM interface

Both classes support the same interface.—fi.partition()

—fi.filename(), fi.ext()

—fi.filesize()

—fi.ctime(), fi.atime(), fi.crtime(), fi.mtime()

—fi.sha1(), fi.md5()

—fi.byteruns(), fi.fragments()

29

Page 16: Automated Document and Media Exploitation

Example: calculate average file size on a disk

Using DOM interface:import fiwalk

objs = fileobjects_using_sax(imagefile, xmlfile, flags)

print "average file size: ",sum([fi.filesize() for fi in objs]) / len(objs)

Or, for the Python-impaired:import fiwalk

objs = fileobjects_using_sax(imagefile, xmlfile, flags)

sum_of_sizes = 0

for fi in objs:

sum_of_sizes += fi.filesize()

print "average file size: ",sum_of_sizes / len(objs)

30

Example: Find and print all the files 15-bytes in length.

Using DOM interface:import fiwalk

objs = fileobjects_using_sax(imagefile, xmlfile, flags)

for fi in filter(lambda x:x.filesize()==15, objs):

print fi

Or, for the Python-impaired:import fiwalk

objs = fileobjects_using_sax(imagefile, xmlfile, flags)

for fi in objs:

if fi.filesize()==15:

print fi

31

Page 17: Automated Document and Media Exploitation

The fileobject class allows direct access to file data.

byteruns() is an array of “runs.”<byte_runs type=’resident’>

<run file_offset='0' len='65536'

fs_offset='871588864' img_offset='871621120'/>

<run file_offset='65536' len='25920'

fs_offset='871748608' img_offset='871780864'/>

</byte_runs>

Becomes:[[0, 65536, 871621120],

[65536, 25920, 871780864]]

Each run has:—run[0] — Start byte offset within the file

—run[1] — Number of bytes

—run[2] — Start offset from the beginning of the disk

32

The fileobject class allows direct access to file data.

byteruns() returns that array of “runs” for both the DOM and SAX-based file objects.

>>> print fi.byteruns()

[[0, 65536, 871621120],

[65536, 25920, 871780864]]

Accessor Methods:! fi.contents_for_run(run) — Returns the bytes from the linked disk image

! fi.contents() — Returns all of the contents

! fi.file_present(imagefile=None) — Validates MD5/SHA1 to see if image has file

! fi.tempfile(calMD5,calcSHA1) — Creates a tempfile, optionally calculating hash

33

Page 18: Automated Document and Media Exploitation

We are building several interconnected applications with this framework.

imap.py

! reads a disk image or XML file and prints a “map” of a disk image.

igroundtruth.py

! reads multiple disk images (different generations of the same disk)

! uses earlier images as “maps” for later images.

! Outputs new XML file

iverify.py

! Reads an image file and XML file.

! Reports which files in the XML file are actually resident in the image.

iredact.py

! reads a disk image (or XML file) and a “redaction file”

! Produces new disk image.

34

The redaction language is flexible.

Language: {CONDITION} {ACTION}

Conditions:

! FILENAME filename

! FILEPAT file*name

! DIRNAME dirname/

! MD5 d41d8cd98f00b204e9800998ecf8427e

! SHA1 da39a3ee5e6b4b0d3255bfef95601890afd80709

! FILE CONTAINS [email protected]

! SECTOR CONTAINS [email protected]

Actions:

! FILL 0x44

! ENCRYPT

! FUZZ (changes instructions but not strings)

35

Page 19: Automated Document and Media Exploitation

We have also built a USB transfer kiosk.

The kiosk:

! Reads a USB drive using fiwalk & fiwalk.py

! Displays list of files in GUI

! Transfers selected files to “quarantine” without mounting the disk image.

! Virus scans

! Transfers scanned files to SMB server without mounting the file server.

Key features:

! Functionality could not be implemented without forensic tools

! fiwalk & fiwalk.py allows forensics to be abstracted away

! Kiosk program is mostly GUI, not forensics

—filelist.py — 110 lines

—kiosk.py — 368 lines

—loginpanel.py — 70 lines

—smb.py — 90 lines

—watcher.py — 152 lines

36

Publishing XML for disk images enables our remote exploitation methodology...

37

Extract metadata in Boston.

Search from Monterey.

Just download what you need.

Page 20: Automated Document and Media Exploitation

Publishing XML for disk images enables our remote exploitation methodology...

37

AFF

Extract metadata in Boston.

Search from Monterey.

Just download what you need.

Publishing XML for disk images enables our remote exploitation methodology...

37

AFF

XML

Extract metadata in Boston.

Search from Monterey.

Just download what you need.

Page 21: Automated Document and Media Exploitation

Publishing XML for disk images enables our remote exploitation methodology...

37

AFF

XML

http

Extract metadata in Boston.

Search from Monterey.

Just download what you need.

Publishing XML for disk images enables our remote exploitation methodology...

37

AFF

XML

http

xmlrpc

Extract metadata in Boston.

Search from Monterey.

Just download what you need.

Page 22: Automated Document and Media Exploitation

Publishing XML for disk images enables our remote exploitation methodology...

37

AFF

XML

http

xmlrpc

Extract metadata in Boston.

Search from Monterey.

Just download what you need.

Realistic Images

38

Page 23: Automated Document and Media Exploitation

IMAGE 1 - CANON2 (FAT32)

IMAGE 2 - NTFS1 (NTFS)

IMAGE 3 - UBNIST1 (FAT32)

IMAGE 4 - UBNIST1 (embedded EXT3)

IMAGE 5 - DOMEXUSERS (NTFS)

IMAGE 6 - HFSNIST1 (HFS+)

Each image has:

! Narrative of how the image was created and expected uses.

! Image file in RAW/SPLITRAW, AFF and E01 formats

! SHA1 of raw image

! “Ground truth” report

39

We have created six disk images.

IMAGE 1 - nps-2009-canon2 (FAT32)

size filename SHA1

=============================================================================

31129600 nps-2008-canon2-gen1.raw 67364b0894a0465d6ada8c4966b6bbcaf7039082

31129600 nps-2008-canon2-gen2.raw 0e3cdef3b1a7d3762f9704bfd4349033fe808eda

31129600 nps-2008-canon2-gen3.raw 7dc8be7f3993c37f101c0ed0fec4274abccacf3c

31129600 nps-2008-canon2-gen4.raw ed1c7dea94096ad309b32037cb6d43a291952d8d

31129600 nps-2008-canon2-gen5.raw 63e7f9daf8dbcd1744579e579f3f0fddebe2ee90

31129600 nps-2008-canon2-gen6.raw 4742c325f10583dab1eb4c55d0d45ab3beb99eb3

“Disk” is a 32MB SD card shot in a Canon camera.

All operations carried out by camera:! Disk formatting (-o51 !)

! JPEG creation

! JPEG deletion

Disk was repeatedly removed from camera & imaged.

JPEGs were created and deleted in such a way to assure:! Fragmented files

! Files that can only be recovered through file carving (name overwritten, not data.)

40

Page 24: Automated Document and Media Exploitation

nps-2009-canon2 is boring...

$ fls -rp -o51 nps-2008-canon2-gen6.raw

d/d 4:! DCIM

d/d 517:! DCIM/100CANON

r/r 1029:! DCIM/100CANON/IMG_0044.JPG

r/r 1030:! DCIM/100CANON/IMG_0042.JPG

r/r 1031:! DCIM/100CANON/IMG_0003.JPG

r/r 1032:! DCIM/100CANON/IMG_0043.JPG

r/r 1033:! DCIM/100CANON/IMG_0045.JPG

r/r 1034:! DCIM/100CANON/IMG_0046.JPG

r/r 1035:! DCIM/100CANON/IMG_0007.JPG

r/r 1036:! DCIM/100CANON/IMG_0047.JPG

r/r 1037:! DCIM/100CANON/IMG_0009.JPG

r/r 1038:! DCIM/100CANON/IMG_0038.JPG

r/r 1039:! DCIM/100CANON/IMG_0011.JPG

...

41

IMG_0011.JPG:

... perhaps not so boring

It’s hard to avoid placing information in images...

... fortunately nothing here is a problem.

42

Page 25: Automated Document and Media Exploitation

IMAGE 2 - nps-2009-ntfs1 An NTFS file system created with WinXP SP3

277228 Dec 31 18:27 ntfs1-gen0.aff

8481452 Dec 31 18:27 ntfs1-gen1.aff

35551648 Jan 6 16:27 ntfs1-gen2.aff

Three directories:d/d 28-144-8:! Compressed

d/d 29-144-10:! Encrypted

d/d 27-144-7:! RAW

EFS key information:r/r 46-128-1:! EFS-key-info.txt

r/r 43-128-4:! EFS-key-no-password.pfx

r/r 45-128-4:! EFS-key-password-strong-protection.pfx

r/r 44-128-4:! EFS-key-password.pfx

The same files are in each directory:r/r 42-128-1:! RAW/20076517123273.pdf

r/r 47-128-0:! RAW/logfile1.txt

r/r 36-128-1:! RAW/NISTSP800-88_rev1.pdf

r/r 33-128-1:! RAW/NIST_logo.jpg

r/r 39-128-1:! RAW/report02-3.pdf

No existing program will automatically identify the key and then try it on the encrypted files!

43

The “logfile.txt” file was written a line-at-a-time to each directory and is fragmented…

<fileobject>

<filename>RAW/logfile1.txt</filename>

<filesize>21888890</filesize>

<partition>1</partition>

<ALLOC>1</ALLOC>

<USED>1</USED>

<mtime>1231192883</mtime>

<ctime>1231192883</ctime>

<atime>1231192883</atime>

<crtime>1231192820</crtime>

<libmagic>ASCII text, with CRLF line terminators</libmagic>

<byte_runs type=’resident’>

<run fs_offset='237428736' img_offset='237428736'

file_offset='0' len='1024'/>

<run fs_offset='243657728' img_offset='243657728'

file_offset='1024' len='3072'/>

<run fs_offset='240057344' img_offset='240057344'

file_offset='4096' len='5120'/> …

1628 fragments in all!

This is a realistic model for the writing of log files.

44

Page 26: Automated Document and Media Exploitation

IMAGE 3 - nps-2009-ubnist1 (FAT32)

UBNIST1 is a bootable USB flash drive running Ubuntu 8.10.

The “outer” file system is FAT32:$ fls -o63 ubnist1-2009-01-07.aff

r/r 4:! ldlinux.sys

d/d 6:! casper

d/d 8:! dists

d/d 10:! install

r/r 12:! syslinux.cfg

d/d 14:! pics

d/d 16:! pool

d/d 18:! preseed

d/d 20:! .disk

r/r 22:! autorun.inf

r/r 24:! md5sum.txt

r/r 27:! README.diskdefines

r/r 29:! umenu.exe

r/r 31:! wubi.exe

d/d 33:! syslinux

r/r 35:! casper-rw

45

IMAGE 4 - 2009-nps-casper-rw (embedded EXT3)

Inside UBNIST1 is an ext3 file system:

$ icat -o63 ubnist1.gen2.aff 35 > ubnist1.gen2.casper-rw

$ fls ubnist1.gen2.casper-rw

d/d 11:! lost+found

r/r 12:! .wh..wh.aufs

d/d 7681:! .wh..wh.plnk

d/d 23041:! .wh..wh..tmp

d/d 7682:! rofs

d/d 23042:! etc

d/d 23044:! cdrom

d/d 7683:! var

d/d 15361:! home

d/d 30721:! tmp

d/d 30722:! lib

d/d 15377:! usr

d/d 7712:! sbin

d/d 13:! root

r/r * 20(realloc):! .aufs.xino

d/d 38401:! $OrphanFiles

$

46

Page 27: Automated Document and Media Exploitation

IMAGE 5 - nps-2009-domexusers1 (NTFS)

Image created by LT Paul Farrell as part of his master’s thesis.

$ mmls realistic.aff

DOS Partition Table

Offset Sector: 0

Units are in 512-byte sectors

Slot Start End Length Description

00: Meta 0000000000 0000000000 0000000001 Primary Table (#0)

01: ----- 0000000000 0000000062 0000000063 Unallocated

02: 00:00 0000000063 0083859299 0083859237 NTFS (0x07)

03: ----- 0083859300 0083886079 0000026780 Unallocated

$

Contents: NTFS Windows XP installation.

Two users: domex1 and domex2

Email, Chat, limited web browsing.

47

IMAGE5 - nps-2009-domexusers1Users

$ fls -o63 realistic.aff 3524-144-6

d/d 10219-144-6:! Administrator

d/d 3526-144-6:! All Users

d/d 3525-144-7:! Default User

d/d 27708-144-5:! domex1

d/d 28463-144-5:! domex2

d/d 10146-144-6:! LocalService

d/d 3370-144-6:! NetworkService

$

There is a lot of work to do with this image before it is released:

! Redact Microsoft © data.

! Verify that image contains no PII or other copyrighted data.

! Detailed characterization report.

48

Page 28: Automated Document and Media Exploitation

IMAGE 6 - nps-2009-hfsjtest1 (HFS+)

This is a very simple image:

! Two files — file1.txt and file2.txt:

File1 has two sets of contents

! “This is file 1 - snarf”

! “New file 1 contents - snarf”

The first set of contents is in the HFS journal.

49

Automated Ascription of Multi-User Data

50

Page 29: Automated Document and Media Exploitation

Alice, Bob, Carol and Dave all share a computer:

File carving reveals "contraband information."

Who put the data on the disk?

51

Problem: How do you ascribe carved data to a particular person?

For example:

! Alice always misspells "kill" as "kilz"

! Everything Bob writes is boring

52

Prior work has used content analysis datamining to determining authorship

Page 30: Automated Document and Media Exploitation

For example:

! Alice always misspells "kill" as "kilz"

! Everything Bob writes is boring

52

Prior work has used content analysis datamining to determining authorship

For example:

! Alice always misspells "kill" as "kilz"

! Everything Bob writes is boring

52

Prior work has used content analysis datamining to determining authorship

Page 31: Automated Document and Media Exploitation

This research uses metadata to infer ownership or agency — who is responsible for the data.

Available sector metadata (“extrinsic”):

! Fragmentation patterns (disk usage)

! Where the file is on the hard drive (sector numbers)

Available file metadata (“intrinsic”):! When the file was created (internal timestamps)

! Make & model of digital cameras

! Usage patterns.

53

Let's look at the hard drive:

Whose data is this?

54

Contraband Information

Page 32: Automated Document and Media Exploitation

Alice's information is all in /Documents and Settings/Alice

All Alice's data is green.

All data is 3 blocks long.

55

Bob's information is in /Documents and Settings/Bob

Bob's information is all monochrome.

It's all labeled"Bob"

It's kind of boring.

56

Bob

Bob

Bob

Bob

Page 33: Automated Document and Media Exploitation

Carol's information is in /Documents and Settings/Carol

Carol's information is all JPEGs

It's all little dots

57

Dave's information is all in/Documents and Settings/Dave

It's all jumbled.

But a lot of the data starts two blocks in….

Some of thedata starts10 blocks in.

58

Page 34: Automated Document and Media Exploitation

So who is responsible for the contraband data?

59

Contraband Information

Carved data has more in common with Dave's data and violates others' rules.

!starts two blocks over.

!It's not a JPEG.

!It's not green.

!It’s not monochrome

Contraband Information

Bob

Bob

Bob

Bob

Dave is responsible for the carved data.

60

Page 35: Automated Document and Media Exploitation

We are developing a system for ascribing carved information on multi-user systems.

61

We are developing a system for ascribing carved information on multi-user systems.

Inputs 1:

! One or more disk images

! A rule for ascribing the exemplars.Here the rule was /Documents and Settings/username/*

! Plug-in metadata extractors

61

Page 36: Automated Document and Media Exploitation

We are developing a system for ascribing carved information on multi-user systems.

Inputs 1:

! One or more disk images

! A rule for ascribing the exemplars.Here the rule was /Documents and Settings/username/*

! Plug-in metadata extractors

Outputs 1:

! A rule-based system for ascribing files

! The accuracy of the system using N-fold validation

61

We are developing a system for ascribing carved information on multi-user systems.

Inputs 1:

! One or more disk images

! A rule for ascribing the exemplars.Here the rule was /Documents and Settings/username/*

! Plug-in metadata extractors

Outputs 1:

! A rule-based system for ascribing files

! The accuracy of the system using N-fold validation

Input 2:

! Carved data

61

Page 37: Automated Document and Media Exploitation

We are developing a system for ascribing carved information on multi-user systems.

Inputs 1:

! One or more disk images

! A rule for ascribing the exemplars.Here the rule was /Documents and Settings/username/*

! Plug-in metadata extractors

Outputs 1:

! A rule-based system for ascribing files

! The accuracy of the system using N-fold validation

Input 2:

! Carved data

Output 2:

! Ascribed individual & confidence of match

61

Key points of data ascription

The accuracy rates we have seen are phenomenal:

! 98.5% on laboratory data

! 99% - 100% on real data.

! Mostly based on sector number & embedded timestamps

! Classification Algorithm: J48 / C4.5

This technique relies upon exemplars to train the machine learning algorithm.

! You need a multi-user system;

! It could be applied to USB memory sticks, if each one identifies to a different user.

62

Page 38: Automated Document and Media Exploitation

Cross-Drive Analysis

63

Cross-Drive Analysis: Making use of the datasphere

The datasphere is information:

! on devices taken from people-of-interest

! on remote systems used by people-of-interest

! on devices and systems co-located in time and space with these systems.

Examples:

! All of the laptops, desktops and servers in the Washington DC area.

Hypothesis 1: Datasphere sampling can build better single-drive analysis tools.

Hypothesis 2: Datasphere correlation can identify and characterize previously unknown social networks.

64

Page 39: Automated Document and Media Exploitation

Network Determination: The Big Idea

1. Media is collected in theater

65

Network Determination: The Big Idea

1. Media is collected in theater

2. Some drives belong to IED-planting cells

65

Page 40: Automated Document and Media Exploitation

Network Determination: The Big Idea

1. Media is collected in theater

2. Some drives belong to IED-planting cells

3. Automatically identify links

65

Network Determination: The Big Idea

1. Media is collected in theater

2. Some drives belong to IED-planting cells

3. Automatically identify links

4. Identify new network members.

65

Page 41: Automated Document and Media Exploitation

Network Determination: The Big Idea

1. Media is collected in theater

2. Some drives belong to IED-planting cells

3. Automatically identify links

?

5. New drive arrives

4. Identify new network members.

65

Network Determination: The Big Idea

1. Media is collected in theater

2. Some drives belong to IED-planting cells

3. Automatically identify links

?

5. New drive arrives

4. Identify new network members.

6. Find links with existing drives

65

Page 42: Automated Document and Media Exploitation

Network Determination: The Big Idea

1. Media is collected in theater

2. Some drives belong to IED-planting cells

3. Automatically identify links

?

5. New drive arrives

4. Identify new network members.

6. Find links with existing drives

7. Determine network membership

65

Network Determination: The Big Idea

1. Media is collected in theater

2. Some drives belong to IED-planting cells

3. Automatically identify links

?

5. New drive arrives

4. Identify new network members.

6. Find links with existing drives

7. Determine network membership

“Datasphere"

65

Page 43: Automated Document and Media Exploitation

We efficiently process the datasphere by separating metadata, identities, and content.

66

We efficiently process the datasphere by separating metadata, identities, and content.

66

file names, email messages, etc.

Page 44: Automated Document and Media Exploitation

We efficiently process the datasphere by separating metadata, identities, and content.

66

file names, email messages, etc.

human names

photographs

camera serial numbers

ABDUL MARTIN

AERO CONTINENT

PATINO ORTIZ

Data space and feature extraction tools

Two tools for data space and feature extraction:

Fiwalk (File and inode Walk):

1. Ingests disk images (raw, EnCase, or AFF format)

2. Automatically extracts metadata (EXIFs, Microsoft Office Doc Properties)

3. Outputs in ASCII, XML, CSV, or ARFF format

“Bulk Extractor”

1. Processes any disk, even damaged/incomplete

2. Scans for email addresses.

3. Outputs list of email addresses & count (sorted by frequency)

67

Page 45: Automated Document and Media Exploitation

Hot Drive Detection:Automatically finding “the good stuff.”

We were interested in showing that sensitive information had been left behind on hard drives.

We wrote a Credit Card Number detector.

Bulk Data

CCN Detector

3713 107885 95004

@ offset 554,553,664

4388 5750 2212 4449

@ offset 43,332,334,554

5543 3343 1644 3345

@ offset 6,334,877,809

...

68

CCN Detector: written in flex and C++

Sample output:'CHASE NA|5422-4128-3008-3685| pos=13152133'DISCOVER|6011-0052-8056-4504| pos=13152440.'GE CARD|4055-9000-0378-1959| pos=13152589BANK ONE |4332-2213-0038-0832| pos=13152740.'NORWEST|4829-0000-4102-9233| pos=13153182'SNB CARD|5419-7213-0101-3624| pos=13153332

Test # pass

pattern 3857

Known prefixes 90

CCV1 43

patterns & histogram 38

Disk #105:

69

Page 46: Automated Document and Media Exploitation

Histogram analysis:importance on multiple drives is k/(drive count)

Experiment: 429 Drives

Source Countries: China and Mexico

Automatically Constructed Stop List:

Email Address Drive Frequency

[email protected] 121

[email protected] 115

[email protected] 115

[email protected] 105

[email protected] 98

[email protected] 95

[email protected] 95

[email protected] 94

[email protected] 93

[email protected] 83

[email protected] 79

[email protected] 78

[email protected] 78

72

4 “Subject” Drives withheld from ingest

Questions: Can we find which networks they associate with?

Can we learn why they associate?

Results: Drive X5-22 correlates with X5-25 and X5-28

DrivePair

Correlation Coefficient(linear scale)

X5-22 and X5-25 9.25

X5-22 and X5-28 9.13

X5-22 and X5-27 6.86

X5-22 and X5-26 5.95

X5-22 and X5-03 3.43

X5-22 and X2-09 2.20

X5-22 and X3-07 2.10

X5-22 and X3-21 2.08

X5-22

X5-28

X5-25

X5-03

X5-27

X3-21

X2-09

X5-26

X3-07Most drive pairs have CC << 0.1

73

Page 47: Automated Document and Media Exploitation

TF/IDF: Those most important email addresses are those seen on fewest # of drives

Email Address (Sanitized) # Correlation Contribution

[email protected] 2 0.5

[email protected] 2 0.5

[email protected] 2 0.5

[email protected] 2 0.5

[email protected] 2 0.5

[email protected] 2 0.5

[email protected] 3 0.3

[email protected] 3 0.3

[email protected] 3 0.3

[email protected] 3 0.3

[email protected] 4 0.25

[email protected] 4 0.25

[email protected] 4 0.25

[email protected] 4 0.25

[email protected] 4 0.25

[email protected] 5 0.2

[email protected] 5 0.2

[email protected] 5 0.2

[email protected] 5 0.2

[email protected] 5 0.2

TOTAL 9.25Most common domain reveals organization of previous owner

X5-25X5-22

[email protected]

[email protected]

[email protected]

[email protected]

X3-07

X3-21

74

Network Membership Tool

Cross Drive Analysis (with email addresses)

1. Rank all email addresses by drive frequency

2. Top 1/3 most popular email addresses are placed on a stop list.

3. Rate email addresses by inverse drive frequency

• More common email addresses are less important

Network Determination Tool

1. Extract email addresses from Subject drive

2. Compute score for each email address with other drives in datasphere

3. Report highest-scoring drive matches

• Show email addresses most responsible for drive score.

75

Page 48: Automated Document and Media Exploitation

Needed: Visualization Approaches

This visualization plots C(d1,d2) for all drives:

76

Drives #74 x #77

25 CCNS

in common

Drives #171 & #172

13 CCNS

in common

Drives #179 & #206

13 CCNS

in common

Same Community College

SameMedical Center

SameCar Dealership

Use encrypted bloom filters for secure field deployment.

Encrypted Bloom filter tool:

! Background BF created from non-Network drives

! In-Network drives are processed with tool:

—Background features removed

—Remaining features produce in-Network BF

! Captured drives are filtered against in-Network BF:

—Presence of features indicates network membership.

Advantages:

! Searches all drives simultaneously

! Bloom filter cannot be reverse-engineered by enemy

!“Practical Applications of Bloom filters to the NIST RDS and hard drive triage,” Farrell, Garfinkel & White, ACSAC 2008

Bloom

present/not present

77

Page 49: Automated Document and Media Exploitation

Different targets will have different behavioral profiles.

Behavioral templates can help identify the target.

Future work: searching the datasphere with behavior profiles and templates.

78

Future work: Clustering algorithms to identify "unknown unknowns."

79

sleuthkit.org is the official web site for The Sleuth Kit and Autopsy Browser. Both are open source digital investigation tools

sleuthkit.org is the official web site for The Sleuth Kit and Autopsy Browser. Both are open source digital investigation tools

sleuthkit.org is the official web site for The Sleuth Kit and Autopsy Browser. Both are open source digital investigation tools

sleuthkit.org is the official web site for The Sleuth Kit and Autopsy Browser. Both are open source digital investigation tools

Page 50: Automated Document and Media Exploitation

Conclusions

80

Research Collaborations

Data-Sharing Partnerships:

! DC3

! MITRE

! I.D.E.A.L. Technology Corporation (STRIKE)

! Professor Patrick Wolfe, Harvard University, NSF EXP program

! Professor Lorrie Cranor, Carnegie Mellon University

Research Partnerships:

! FBI Silicon Valley Regional Computer Forensic Lab

! Police Department, San Louis Obispo, California

81

Page 51: Automated Document and Media Exploitation

In Summary

Automated exploitation systems can be made more powerful by making them aware of the "Datasphere."

The key is to use every piece of available data—don't stop with the obvious content.

Datamining algorithms can be profitably applied to forensic problems.

82

Contraband Information

Bob

Bob

Bob

Bob

Automated

DOMEX

Database

flower.jpg

my_presentation.ppt

memo.doc

sleuthkit.org is the official web site for The Sleuth Kit and Autopsy Browser. Both are open source digital investigation tools

email addresses

credit card numbers

timestamps

hash codes

d41d8cd98f00b204e9800998ecf8427e

Datasphere Repository

Questions?