automated document and media exploitation
TRANSCRIPT
Automated Document and Media Exploitation
Simson Garfinkel, Ph.D.Associate ProfessorNaval Postgraduate Schoolhttp://faculty.nps.edu/slgarfin/
NPS is the Navy’s Research University.
Location: Monterey, CA
Campus Size: 627 acres
Students: 1500! US Military (All 5 services)
! US Civilian (Scholarship for Service & SMART)
! Foreign Military (30 countries)
! All students are fully funded
Schools:
! Business & Public Policy
! Engineering & Applied Sciences
! Operational & Information Sciences
! International Graduate Studies
2
DEEP Lab(Digital Evaluation and ExPloitation)
Current Projects:
! High-speed AES implementation on CellBE SPUs
—6 cores at a time on Sony PS3!
! User-level ATA command generator for Linux.
—For testing ATA/SATA write-blockers
! Corpora for Forensic Research
—Real Data Corpus (2000 “images” of hard drives, SD cards, etc.)
—Realistic Corpus (constructed disk images)
! Sector Discrimination Project
—What can you say about a 4096-byte sector/cluster from a hard drive?
! A piece of a known file?
! A JPEG / HTML / ZIP / DOC / DOCX?
! Encrypted?
3
Automated Document and Media Exploitation (DOMEX):
The Need
4
Law enforcement & military actives encounter lots of digital documents and media.
5
!"!#$% & % '( ) ' * & +,-% & . $ ' '
& " ' . /, ' !. .(-",$ '0(-1)(-2, ' '
1 345 ' 2("%6 # . & "5 ' 76",'89:';<<;' 3!5,',%=;'
!
!
"#$!%$&'()*$+)!,-..$(,!.(/*!'+!"#$%&'($)&!*(+$#!,&-.(,/&-!+$#$0&+&#)!"#1,$-),(/)(,&0!!
"#$($!1,!213$,&($'3!&$(4$&)1/+5!$,&$41'667!'*/+8!*1+/(1)1$,5!)#')!9:!&('4)14$,!6'4;!
)('+,&'($+470!!"#1,!($,-6),!1+!'))/(+$7,!&$(4$1<1+8!)#')!&('4)14$,!'($!-+.'1(0!!"#$!%$&'()*$+)!3/$,!
+/)!$*&#',1=$!4'($$(!3$<$6/&*$+)5!'+3!)//6,!./(!&$(./(*'+4$!'&&('1,'6!'($!3$.141$+)0!!>,!'!
($,-6)5!'))/(+$7,!41)$!&//(!?&$/&6$!*'+'8$*$+)@!A7!,-&$(<1,/(,0!
!
2&/)".#!3*"&1-!$,&!$#!&4),&+&56!/,")"/$5!&5&+&#)!/.!)#$!%$&'()*$+)B,!31<$(,1)7!461*')$0!!"#$7!
#'<$!,18+1.14'+)!'-)#/(1)7!1+!($4(-1)*$+)5!#1(1+85!&(/*/)1/+5!&$(./(*'+4$!'&&('1,'65!4',$!
',,18+*$+)5!'+3!4'($$(!3$<$6/&*$+)0!!"#$!C$4)1/+!D#1$.!2/(;./(4$!1,!+/)!31<$(,$!'+3!)-(+/<$(!1,!
6/20!!"#1,!&'))$(+5!4/*A1+$3!21)#!)#$!8$+$('667!6/2!'))$+)1/+!)#')!)#$,$!*'+'8$(,!&'7!)/!,)'..!
4'($$(!3$<$6/&*$+)5!6$'3,!*1+/(1)1$,!)/!&$(4$1<$!'!6'4;!/.!'3<'+4$*$+)!/&&/()-+1)1$,0!
!
"#$!%$&'()*$+)B,!'))/(+$7!2/(;./(4$!1,!+.,&!%"7&,-&!)*$#!)*&!8929!5&0$5!:.,;1.,/&E!!FGH!
.$*'6$5!4/*&'($3!)/!FIH!1+!)#$!J0C0!6$8'6!6'A/(!&//65!'+3!KLH!*1+/(1)75!4/*&'($3!)/!KMH!1+!
)#$!6'A/(!&//60!!"#$!%$&'()*$+)B,!'))/(+$7!2/(;./(4$!1,!'A/-)!$-!%"7&,-&!$-!)*&!1&%&,$5!
0.7&,#+&#)!5&0$5!:.,;1.,/&5!2#/,$!'))/(+$7,!'($!FGH!.$*'6$!'+3!KNH!*1+/(1)70!
!
<","#0!"-!-&,7"#0!).!+$;&!)*&!=&>$,)+&#)!&7&#!+.,&!%"7&,-&E!!#1($,!1+!MIIK!2$($!OIH!.$*'6$!
'+3!MKH!*1+/(1)70!!P+!&'()14-6'(5!)*&!?)).,#&6!@&#&,$5A-!<.#.,-!B,.0,$+!"-!$#!"+>.,)$#)!
)..5!./(!1+4($',1+8!31<$(,1)70!!9/+/(,!Q(/8('*!#1($,!1+!MIIK!2$($!NFH!.$*'6$5!4/*&'($3!)/!OLH!
/.!)#$!6'2!,4#//6!8('3-')1+8!46',,5!'+3!FIH!*1+/(1)75!4/*&'($3!)/!MKH!/.!)#$!46',,!/.!MIIK0!
!
R1+/(1)1$,!'($!-"0#"1"/$#)56!(#%&,C,&>,&-&#)&%!"#!+$#$0&+&#)!,$#;-0!!"#$7!4/*&(1,$!/+67!
SH!/.!T4'($$(U!CVC!'))/(+$7,!'+3!KKH!/.!,-&$(<1,/(7!>,,1,)'+)!J0C0!>))/(+$7,0!!W/*$+!
4/+,)1)-)$!FKH!/.!CVC,!'+3!FSH!/.!,-&$(<1,/(7!>JC>,0!!>*/+8!XCYKL!'))/(+$7,!1+!)#$!
Z1)18')1+8!%1<1,1/+,5!*1+/(1)1$,!4/*&(1,$!KKH!/.!+/+Y,-&$(<1,/(,!'+3!NH!/.!,-&$(<1,/(,5!'+3!
2/*$+!4/*&(1,$!FSH!/.!+/+Y,-&$(<1,/(,!'+3!FFH!/.!,-&$(<1,/(,0!
!
D"#.,")"&-!$,&!-(E-)$#)"$556!+.,&!5";&56!).!5&$7&!)*&!=&>$,)+&#)!)*$#!:*")&-0!!P+!MIIK5!)#$!
'))(1)1/+!(')$!2',!O[H!#18#$(!'*/+8!*1+/(1)1$,!)#'+!2#1)$,0!!"#$($!2',!+/!31..$($+4$!1+!($4$+)!
'))(1)1/+!A$)2$$+!*$+!'+3!2/*$+0!
!
"#$($!'($!'6,/!,)')1,)14'667!-"0#"1"/$#)!,$/&!$#%F.,!0&#%&,!&11&/)-!/+!'!+-*A$(!/.!9:!/-)4/*$,5!
1+46-31+8!,)'()1+8!8('3$5!4-(($+)!8('3$5!&(/*/)1/+,5!'+3!4/*&$+,')1/+0!!\/(!$]'*&6$5!)#$!'<$('8$!
*1+/(1)7!XC!'))/(+$7!1,!4-(($+)67!I0O!,)$&,!6/2$(!)#'+!)#$!'<$('8$!2#1)$5!'+3!)#$!'<$('8$!2/*'+!
1,!I0F!,)$&,!6/2$(!)#'+!)#$!'<$('8$!*'+5!4/+)(/661+8!./(!,$+1/(1)75!8('3$5!'+3!4/*&/+$+)0!
!
^',$3!/+!)#$,$!.1+31+8,5!2$!($4/**$+3!)#')!)#$!%$&'()*$+)!)';$!)#$!./66/21+8!'4)1/+,E!
!
G4&,/"-&!?@C!$#%!=?@C5&7&5!5&$%&,-*">!)/!,)($,,!)#$!1*&/()'+4$!/.!31<$(,1)7!'+3!)#$1(!
4/**1)*$+)!)/!1)0!!Q-A61467!4/**1)!)#$!%$&'()*$+)!)/!&'(1)7!A/)#!1+!31<$(,1)7!/-)4/*$,!T$0805!
4/*&'('A6$!($&($,$+)')1/+!')!'66!6$<$6,U!'+3!1+!'))1)-3$,!T$0805!_/A!,')1,.'4)1/+U!'*/+8!'66!
3$*/8('!8(/-&,0!!P3$+)1.7!6$<$(,!./(!4#'+8$5!./4-,1+8!/+!>>X,!T2#/!'($!31<$(,$U!'+3!
C$4)1/+!D#1$.,0!!P*&6$*$+)!)('1+1+8!/.!6$'3$(,!)/!13$+)1.7!)#$1(!(/6$!1+!,#'&1+8!2/(;!461*')$!
1,,-$,!'+3!1+!$..$4)-')1+8!4#'+8$0!
!
!
June 2007
S M T W T F S
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
Most of this data is analyzed using trained personnel and off-the-shelf software.
DOMEX in Iraq
6
Software is mostly COTS GUI tools designed for law enforcement.
- Designed for visibility & search.
- Does not scale to 100s or 1000s of drives.
EnCase by Guidance Software
7
Manual analysis misses opportunities for correlation.
Different analysts see different hard drives.
Keyword searches don’t connect the dots.
8
email from [email protected] to [email protected]
Tools designed for Law Enforcement do notuse data once the investigation is over.
Bad guy goes to prison. Hard drive goes to storage.
9
"Analysts in a tent" do not scale.
10
We are building tools for performing Automated DOMEX.
Unrestricted Research Corpus
! Several thousand disk & device images
New techniques & algorithms
! Designed to exploit a data rich environment.
! Designed for autonomous operation.
Legal approaches for working with data! “Realistic Data” — For tool testing & education.
! “Redistributable Real Data” —!real but unrestricted.
! “Real Data” — For advanced research.
! IRB (Institutional Review Board) approaches and protocols.
11
We are building…... an unrestricted research corpus.
Hard drives, USB sticks, Digital Cameras & cell phones:
! Purchased outside the US on the secondary market.
! Available to qualified researchers.
Multiple Corpora:
! Non-US Persons Corpus
! US Persons Corpus (at Harvard)
! “Realistic” Corpus (No PII)
12
We are developing…... automated DOMEX algorithms.
Ascription:
! Batch analysis of drive contents
! Automated determination of owner & source
! Attribution of carved data
Hot Drive Identification:
! Automated triage without using keywords
! “Datasphere”—aware
Cross-Drive Analysis:! Reconstruction of social networks
.
0
200
10, 000
20, 000
30, 000
40, 000Unique CCNs
Total CCNs
Drive #215182 CCNS1356 unique
Drive #17231348 CCNS11609 unique
Drive #171346 CCNS81 unique
Supermarket
ATM
StateSecretary'sOffice
MedicalCenter
AutoDealership
SoftwareVendor
Drives #74 x #77
25 CCNS
in common
Drives #171 & #172
13 CCNS
in common
Drives #179 & #206
13 CCNS
in common
Same Community College
SameMedical Center
SameCar Dealership
13
End-to-End automated analysis can increase exploitation capabilities and connect the dots.
14
Datasphere Repository
Who? Contacts?
Anything unusual?
What was done? Anything Encrypted?
The Real Data Corpus
15
The Real Data Corpus: "Real Data from Real People."
Most forensic work is based on “realistic” data created in a lab.
We get real data from CN, IN, IL, MX, and other countries.
Real data provides:! Real-world experience with data management problems.
! Unpredictable OS, software, & content
! Unanticipated faults
We have multiple corpora:
! Non-US Persons Corpus
! US Persons Corpus (@Harvard)
! Releasable Real Corpus
! Realistic Corpus
16
We image physical devices to disk image files.
17
image.aff
100MB - 40GB
We image physical devices to disk image files.
17
image.aff
100MB - 40GB
The images are stored in the Advanced Forensic Format (AFF)
AFF extensible schema stores:
! Data copied from disk
! Metadata (SN, date of image)
! Chain-of-custody information
AFF supports:
! Compression
! Public Key Encryption & Digital Signatures
! Incremental file transfer
AFF Adoption:
! Sleuth Kit
! Black Bag (Mac Forensics)
! Others have expressed interest
AFF is designed for automating media exploitation
Advanced Forensic Format AFF ®
18
AFF Data Schema
AFF XML
raw
AFF binary
S3ewf
VM
DK
To use the Real Data corpus, you need IRB approval
US Law requires approval of an Institutional Review Board to work with human subject data [45 CFR 46]
! Clearly describe what you want to do.
! Get submit your protocol to your local IRB.
! Provide your protocol and your approval.
! Approval must be renewed every year.
No approval required for “Realistic” data.
! http://www.digitalcorpora.org/
19
Automated Exploitation:Disk Images to XML
20
There are many forensic tools.
Many are programmable
! EnCase — EScript
! PyFlag — Flash Script & Python
! Sleuth Kit — C/C++
But writing programs for these systems is hard:! Programming languages are procedural and mechanism-oriented
! Data is separated from actions on the data.
! Many of the programs are not designed for easy automation.
21
XML makes forensic analysis easier.
XML is ideally suited for forensic analysis
Forensic data is tree-structured.
! Case > Devices > Partitions > Directories > Files
! Files
—file system metadata
—file meta data
—file content
! Container Files (ZIP, tar, CAB)
—We can exactly represent the container structure
—PyFlag does this with “virtual files”
—No easy way to do this with the current TSK/EnCase/FTK structure
—(Note: Container files not currently implemented.)
22
fiwalk extracts disk images into XML.
XML usage:$ fiwalk [-Xfile.xml] [-x] imagefile
! -X file.xml produces an XML file with a full DTD
! -x sends XML to standard out
Other options:
! -c config.txt —!metadata extraction plug-ins
! -C nn — only process nn files
! -S — single threaded
! -z — don’t calculate MD5 or SHA1
! -s dir — save all files in dir
! -Afile.arff — creates an ARFF file with output
! -Bfile — bloomfilter support
23
AFF
<XML>
fiwalk produces three kinds of XML tags.
Per-Image tags <fiwalk> — outer tag
<fiwalk_version>0.4</fiwalk_version>
<Start_time>Mon Oct 13 19:12:09 2008</Start_time>
<Imagefile>dosfs.dmg</Imagefile>
<volume startsector=”512”>
Per <volume> tags:<Partition_Offset>512</Partition_Offset>
<block_size>512</block_size>
<ftype>4</ftype>
<ftype_str>fat16</ftype_str>
<block_count>81982</block_count>
Per <fileobject> tags:<filesize>4096</filesize>
<partition>1</partition>
<filename>linedash.gif</filename>
<libmagic>GIF image data, version 89a, 410 x 143</libmagic>
24
fiwalk XML example
<fileobject>
<filename>WINDOWS/system32/config/systemprofile/「开始」菜!/程序/附件/_rf55.tmp</
filename>
<filesize>1391</filesize>
<unalloc>1</unalloc>
<used>1</used>
<mtime>1150873922</mtime>
<ctime>1160927826</ctime>
<atime>1160884800</atime>
<fragments>0</fragments>
<md5>d41d8cd98f00b204e9800998ecf8427e</md5>
<sha1>da39a3ee5e6b4b0d3255bfef95601890afd80709</sha1>
<partition>1</partition>
<byte_runs type=’resident’>
<run file_offset='0' len='65536'
fs_offset='871588864' img_offset='871621120'/>
<run file_offset='65536' len='25920'
fs_offset='871748608' img_offset='871780864'/>
</byte_runs>
</fileobject>
25
The <byte_runs> array specifies thephysical location on the disk.
One or more <run> elements may be present:
<byte_runs type=’resident’>
<run file_offset='0' len='65536'
fs_offset='871588864' img_offset='871621120'/>
<run file_offset='65536' len='25920'
fs_offset='871748608' img_offset='871780864'/>
</byte_runs>
This file has two fragments:! 64K starting at sector 1702385 (871621120 ÷ 512)
! 25,920 bytes starting at sector 1702697 (871780864 ÷ 512)
Additional XML attributes may specify compression or encryption.
26
fiwalk has a plugable metadata extraction system
Metadata extractors are specified in the configuration file*.jpg dgi ../plugins/jpeg_extract
*.pdf dgi java -classpath plugins.jar Libextract_plugin
—Currently the extractor is chosen by the file extension
—fiwalk runs the plugins in a different process
—We have specs for a native JVM interface which uses IPC and 1 process.
Metadata extractors produce name:value pairs on STDOUT
Manufacturer: SONY
Model: CYBERSHOT
Orientation: top - left
fiwalk incorporates into XML:<fileobject>
...
<Manufacturer>SONY</Manufacturer>
<Model>CYBERSHOT</Model>
<Orientation>top - left</Orientation>
...
</fileobject>
27
fiwalk.py: a Python module for automated forensics.
Key Features:
! Automatically runs fiwalk with correct options if given a disk image
! Reads XML file if present (faster than regenerating)
! Creates fileobject objects.
Multiple interfaces:
! SAX callback interfacefiwalk_using_sax(imagefile, xmlfile, flags, callback)
—Very fast and minimal memory footprint
! SAX procedural interfaceobjs = fileobjects_using_sax(imagefile, xmlfile, flags)
—Reasonably fast; returns a list of all file objects with XML in dictionary
! DOM procedural interface(doc,objs) = fileobjects_using_dom(imagefile, xmlfile, flags)
—Allows modification of XML that’s returned.
28
The SAX and DOM interfaces both return fileobjects!
Our Python fileobject class is an easy-to-use abstract class for working with file system data.
fileobject_sax(fileobject)! — for the SAX interface
fileobject_dom(fileobject)! – for the DOM interface
Both classes support the same interface.—fi.partition()
—fi.filename(), fi.ext()
—fi.filesize()
—fi.ctime(), fi.atime(), fi.crtime(), fi.mtime()
—fi.sha1(), fi.md5()
—fi.byteruns(), fi.fragments()
29
Example: calculate average file size on a disk
Using DOM interface:import fiwalk
objs = fileobjects_using_sax(imagefile, xmlfile, flags)
print "average file size: ",sum([fi.filesize() for fi in objs]) / len(objs)
Or, for the Python-impaired:import fiwalk
objs = fileobjects_using_sax(imagefile, xmlfile, flags)
sum_of_sizes = 0
for fi in objs:
sum_of_sizes += fi.filesize()
print "average file size: ",sum_of_sizes / len(objs)
30
Example: Find and print all the files 15-bytes in length.
Using DOM interface:import fiwalk
objs = fileobjects_using_sax(imagefile, xmlfile, flags)
for fi in filter(lambda x:x.filesize()==15, objs):
print fi
Or, for the Python-impaired:import fiwalk
objs = fileobjects_using_sax(imagefile, xmlfile, flags)
for fi in objs:
if fi.filesize()==15:
print fi
31
The fileobject class allows direct access to file data.
byteruns() is an array of “runs.”<byte_runs type=’resident’>
<run file_offset='0' len='65536'
fs_offset='871588864' img_offset='871621120'/>
<run file_offset='65536' len='25920'
fs_offset='871748608' img_offset='871780864'/>
</byte_runs>
Becomes:[[0, 65536, 871621120],
[65536, 25920, 871780864]]
Each run has:—run[0] — Start byte offset within the file
—run[1] — Number of bytes
—run[2] — Start offset from the beginning of the disk
32
The fileobject class allows direct access to file data.
byteruns() returns that array of “runs” for both the DOM and SAX-based file objects.
>>> print fi.byteruns()
[[0, 65536, 871621120],
[65536, 25920, 871780864]]
Accessor Methods:! fi.contents_for_run(run) — Returns the bytes from the linked disk image
! fi.contents() — Returns all of the contents
! fi.file_present(imagefile=None) — Validates MD5/SHA1 to see if image has file
! fi.tempfile(calMD5,calcSHA1) — Creates a tempfile, optionally calculating hash
33
We are building several interconnected applications with this framework.
imap.py
! reads a disk image or XML file and prints a “map” of a disk image.
igroundtruth.py
! reads multiple disk images (different generations of the same disk)
! uses earlier images as “maps” for later images.
! Outputs new XML file
iverify.py
! Reads an image file and XML file.
! Reports which files in the XML file are actually resident in the image.
iredact.py
! reads a disk image (or XML file) and a “redaction file”
! Produces new disk image.
34
The redaction language is flexible.
Language: {CONDITION} {ACTION}
Conditions:
! FILENAME filename
! FILEPAT file*name
! DIRNAME dirname/
! MD5 d41d8cd98f00b204e9800998ecf8427e
! SHA1 da39a3ee5e6b4b0d3255bfef95601890afd80709
! FILE CONTAINS [email protected]
! SECTOR CONTAINS [email protected]
Actions:
! FILL 0x44
! ENCRYPT
! FUZZ (changes instructions but not strings)
35
We have also built a USB transfer kiosk.
The kiosk:
! Reads a USB drive using fiwalk & fiwalk.py
! Displays list of files in GUI
! Transfers selected files to “quarantine” without mounting the disk image.
! Virus scans
! Transfers scanned files to SMB server without mounting the file server.
Key features:
! Functionality could not be implemented without forensic tools
! fiwalk & fiwalk.py allows forensics to be abstracted away
! Kiosk program is mostly GUI, not forensics
—filelist.py — 110 lines
—kiosk.py — 368 lines
—loginpanel.py — 70 lines
—smb.py — 90 lines
—watcher.py — 152 lines
36
Publishing XML for disk images enables our remote exploitation methodology...
37
Extract metadata in Boston.
Search from Monterey.
Just download what you need.
Publishing XML for disk images enables our remote exploitation methodology...
37
AFF
Extract metadata in Boston.
Search from Monterey.
Just download what you need.
Publishing XML for disk images enables our remote exploitation methodology...
37
AFF
XML
Extract metadata in Boston.
Search from Monterey.
Just download what you need.
Publishing XML for disk images enables our remote exploitation methodology...
37
AFF
XML
http
Extract metadata in Boston.
Search from Monterey.
Just download what you need.
Publishing XML for disk images enables our remote exploitation methodology...
37
AFF
XML
http
xmlrpc
Extract metadata in Boston.
Search from Monterey.
Just download what you need.
Publishing XML for disk images enables our remote exploitation methodology...
37
AFF
XML
http
xmlrpc
Extract metadata in Boston.
Search from Monterey.
Just download what you need.
Realistic Images
38
IMAGE 1 - CANON2 (FAT32)
IMAGE 2 - NTFS1 (NTFS)
IMAGE 3 - UBNIST1 (FAT32)
IMAGE 4 - UBNIST1 (embedded EXT3)
IMAGE 5 - DOMEXUSERS (NTFS)
IMAGE 6 - HFSNIST1 (HFS+)
Each image has:
! Narrative of how the image was created and expected uses.
! Image file in RAW/SPLITRAW, AFF and E01 formats
! SHA1 of raw image
! “Ground truth” report
39
We have created six disk images.
IMAGE 1 - nps-2009-canon2 (FAT32)
size filename SHA1
=============================================================================
31129600 nps-2008-canon2-gen1.raw 67364b0894a0465d6ada8c4966b6bbcaf7039082
31129600 nps-2008-canon2-gen2.raw 0e3cdef3b1a7d3762f9704bfd4349033fe808eda
31129600 nps-2008-canon2-gen3.raw 7dc8be7f3993c37f101c0ed0fec4274abccacf3c
31129600 nps-2008-canon2-gen4.raw ed1c7dea94096ad309b32037cb6d43a291952d8d
31129600 nps-2008-canon2-gen5.raw 63e7f9daf8dbcd1744579e579f3f0fddebe2ee90
31129600 nps-2008-canon2-gen6.raw 4742c325f10583dab1eb4c55d0d45ab3beb99eb3
“Disk” is a 32MB SD card shot in a Canon camera.
All operations carried out by camera:! Disk formatting (-o51 !)
! JPEG creation
! JPEG deletion
Disk was repeatedly removed from camera & imaged.
JPEGs were created and deleted in such a way to assure:! Fragmented files
! Files that can only be recovered through file carving (name overwritten, not data.)
40
nps-2009-canon2 is boring...
$ fls -rp -o51 nps-2008-canon2-gen6.raw
d/d 4:! DCIM
d/d 517:! DCIM/100CANON
r/r 1029:! DCIM/100CANON/IMG_0044.JPG
r/r 1030:! DCIM/100CANON/IMG_0042.JPG
r/r 1031:! DCIM/100CANON/IMG_0003.JPG
r/r 1032:! DCIM/100CANON/IMG_0043.JPG
r/r 1033:! DCIM/100CANON/IMG_0045.JPG
r/r 1034:! DCIM/100CANON/IMG_0046.JPG
r/r 1035:! DCIM/100CANON/IMG_0007.JPG
r/r 1036:! DCIM/100CANON/IMG_0047.JPG
r/r 1037:! DCIM/100CANON/IMG_0009.JPG
r/r 1038:! DCIM/100CANON/IMG_0038.JPG
r/r 1039:! DCIM/100CANON/IMG_0011.JPG
...
41
IMG_0011.JPG:
... perhaps not so boring
It’s hard to avoid placing information in images...
... fortunately nothing here is a problem.
42
IMAGE 2 - nps-2009-ntfs1 An NTFS file system created with WinXP SP3
277228 Dec 31 18:27 ntfs1-gen0.aff
8481452 Dec 31 18:27 ntfs1-gen1.aff
35551648 Jan 6 16:27 ntfs1-gen2.aff
Three directories:d/d 28-144-8:! Compressed
d/d 29-144-10:! Encrypted
d/d 27-144-7:! RAW
EFS key information:r/r 46-128-1:! EFS-key-info.txt
r/r 43-128-4:! EFS-key-no-password.pfx
r/r 45-128-4:! EFS-key-password-strong-protection.pfx
r/r 44-128-4:! EFS-key-password.pfx
The same files are in each directory:r/r 42-128-1:! RAW/20076517123273.pdf
r/r 47-128-0:! RAW/logfile1.txt
r/r 36-128-1:! RAW/NISTSP800-88_rev1.pdf
r/r 33-128-1:! RAW/NIST_logo.jpg
r/r 39-128-1:! RAW/report02-3.pdf
No existing program will automatically identify the key and then try it on the encrypted files!
43
The “logfile.txt” file was written a line-at-a-time to each directory and is fragmented…
<fileobject>
<filename>RAW/logfile1.txt</filename>
<filesize>21888890</filesize>
<partition>1</partition>
<ALLOC>1</ALLOC>
<USED>1</USED>
<mtime>1231192883</mtime>
<ctime>1231192883</ctime>
<atime>1231192883</atime>
<crtime>1231192820</crtime>
<libmagic>ASCII text, with CRLF line terminators</libmagic>
<byte_runs type=’resident’>
<run fs_offset='237428736' img_offset='237428736'
file_offset='0' len='1024'/>
<run fs_offset='243657728' img_offset='243657728'
file_offset='1024' len='3072'/>
<run fs_offset='240057344' img_offset='240057344'
file_offset='4096' len='5120'/> …
1628 fragments in all!
This is a realistic model for the writing of log files.
44
IMAGE 3 - nps-2009-ubnist1 (FAT32)
UBNIST1 is a bootable USB flash drive running Ubuntu 8.10.
The “outer” file system is FAT32:$ fls -o63 ubnist1-2009-01-07.aff
r/r 4:! ldlinux.sys
d/d 6:! casper
d/d 8:! dists
d/d 10:! install
r/r 12:! syslinux.cfg
d/d 14:! pics
d/d 16:! pool
d/d 18:! preseed
d/d 20:! .disk
r/r 22:! autorun.inf
r/r 24:! md5sum.txt
r/r 27:! README.diskdefines
r/r 29:! umenu.exe
r/r 31:! wubi.exe
d/d 33:! syslinux
r/r 35:! casper-rw
45
IMAGE 4 - 2009-nps-casper-rw (embedded EXT3)
Inside UBNIST1 is an ext3 file system:
$ icat -o63 ubnist1.gen2.aff 35 > ubnist1.gen2.casper-rw
$ fls ubnist1.gen2.casper-rw
d/d 11:! lost+found
r/r 12:! .wh..wh.aufs
d/d 7681:! .wh..wh.plnk
d/d 23041:! .wh..wh..tmp
d/d 7682:! rofs
d/d 23042:! etc
d/d 23044:! cdrom
d/d 7683:! var
d/d 15361:! home
d/d 30721:! tmp
d/d 30722:! lib
d/d 15377:! usr
d/d 7712:! sbin
d/d 13:! root
r/r * 20(realloc):! .aufs.xino
d/d 38401:! $OrphanFiles
$
46
IMAGE 5 - nps-2009-domexusers1 (NTFS)
Image created by LT Paul Farrell as part of his master’s thesis.
$ mmls realistic.aff
DOS Partition Table
Offset Sector: 0
Units are in 512-byte sectors
Slot Start End Length Description
00: Meta 0000000000 0000000000 0000000001 Primary Table (#0)
01: ----- 0000000000 0000000062 0000000063 Unallocated
02: 00:00 0000000063 0083859299 0083859237 NTFS (0x07)
03: ----- 0083859300 0083886079 0000026780 Unallocated
$
Contents: NTFS Windows XP installation.
Two users: domex1 and domex2
Email, Chat, limited web browsing.
47
IMAGE5 - nps-2009-domexusers1Users
$ fls -o63 realistic.aff 3524-144-6
d/d 10219-144-6:! Administrator
d/d 3526-144-6:! All Users
d/d 3525-144-7:! Default User
d/d 27708-144-5:! domex1
d/d 28463-144-5:! domex2
d/d 10146-144-6:! LocalService
d/d 3370-144-6:! NetworkService
$
There is a lot of work to do with this image before it is released:
! Redact Microsoft © data.
! Verify that image contains no PII or other copyrighted data.
! Detailed characterization report.
48
IMAGE 6 - nps-2009-hfsjtest1 (HFS+)
This is a very simple image:
! Two files — file1.txt and file2.txt:
File1 has two sets of contents
! “This is file 1 - snarf”
! “New file 1 contents - snarf”
The first set of contents is in the HFS journal.
49
Automated Ascription of Multi-User Data
50
Alice, Bob, Carol and Dave all share a computer:
File carving reveals "contraband information."
Who put the data on the disk?
51
Problem: How do you ascribe carved data to a particular person?
For example:
! Alice always misspells "kill" as "kilz"
! Everything Bob writes is boring
52
Prior work has used content analysis datamining to determining authorship
For example:
! Alice always misspells "kill" as "kilz"
! Everything Bob writes is boring
52
Prior work has used content analysis datamining to determining authorship
For example:
! Alice always misspells "kill" as "kilz"
! Everything Bob writes is boring
52
Prior work has used content analysis datamining to determining authorship
This research uses metadata to infer ownership or agency — who is responsible for the data.
Available sector metadata (“extrinsic”):
! Fragmentation patterns (disk usage)
! Where the file is on the hard drive (sector numbers)
Available file metadata (“intrinsic”):! When the file was created (internal timestamps)
! Make & model of digital cameras
! Usage patterns.
53
Let's look at the hard drive:
Whose data is this?
54
Contraband Information
Alice's information is all in /Documents and Settings/Alice
All Alice's data is green.
All data is 3 blocks long.
55
Bob's information is in /Documents and Settings/Bob
Bob's information is all monochrome.
It's all labeled"Bob"
It's kind of boring.
56
Bob
Bob
Bob
Bob
Carol's information is in /Documents and Settings/Carol
Carol's information is all JPEGs
It's all little dots
57
Dave's information is all in/Documents and Settings/Dave
It's all jumbled.
But a lot of the data starts two blocks in….
Some of thedata starts10 blocks in.
58
So who is responsible for the contraband data?
59
Contraband Information
Carved data has more in common with Dave's data and violates others' rules.
!starts two blocks over.
!It's not a JPEG.
!It's not green.
!It’s not monochrome
Contraband Information
Bob
Bob
Bob
Bob
Dave is responsible for the carved data.
60
We are developing a system for ascribing carved information on multi-user systems.
61
We are developing a system for ascribing carved information on multi-user systems.
Inputs 1:
! One or more disk images
! A rule for ascribing the exemplars.Here the rule was /Documents and Settings/username/*
! Plug-in metadata extractors
61
We are developing a system for ascribing carved information on multi-user systems.
Inputs 1:
! One or more disk images
! A rule for ascribing the exemplars.Here the rule was /Documents and Settings/username/*
! Plug-in metadata extractors
Outputs 1:
! A rule-based system for ascribing files
! The accuracy of the system using N-fold validation
61
We are developing a system for ascribing carved information on multi-user systems.
Inputs 1:
! One or more disk images
! A rule for ascribing the exemplars.Here the rule was /Documents and Settings/username/*
! Plug-in metadata extractors
Outputs 1:
! A rule-based system for ascribing files
! The accuracy of the system using N-fold validation
Input 2:
! Carved data
61
We are developing a system for ascribing carved information on multi-user systems.
Inputs 1:
! One or more disk images
! A rule for ascribing the exemplars.Here the rule was /Documents and Settings/username/*
! Plug-in metadata extractors
Outputs 1:
! A rule-based system for ascribing files
! The accuracy of the system using N-fold validation
Input 2:
! Carved data
Output 2:
! Ascribed individual & confidence of match
61
Key points of data ascription
The accuracy rates we have seen are phenomenal:
! 98.5% on laboratory data
! 99% - 100% on real data.
! Mostly based on sector number & embedded timestamps
! Classification Algorithm: J48 / C4.5
This technique relies upon exemplars to train the machine learning algorithm.
! You need a multi-user system;
! It could be applied to USB memory sticks, if each one identifies to a different user.
62
Cross-Drive Analysis
63
Cross-Drive Analysis: Making use of the datasphere
The datasphere is information:
! on devices taken from people-of-interest
! on remote systems used by people-of-interest
! on devices and systems co-located in time and space with these systems.
Examples:
! All of the laptops, desktops and servers in the Washington DC area.
Hypothesis 1: Datasphere sampling can build better single-drive analysis tools.
Hypothesis 2: Datasphere correlation can identify and characterize previously unknown social networks.
64
Network Determination: The Big Idea
1. Media is collected in theater
65
Network Determination: The Big Idea
1. Media is collected in theater
2. Some drives belong to IED-planting cells
65
Network Determination: The Big Idea
1. Media is collected in theater
2. Some drives belong to IED-planting cells
3. Automatically identify links
65
Network Determination: The Big Idea
1. Media is collected in theater
2. Some drives belong to IED-planting cells
3. Automatically identify links
4. Identify new network members.
65
Network Determination: The Big Idea
1. Media is collected in theater
2. Some drives belong to IED-planting cells
3. Automatically identify links
?
5. New drive arrives
4. Identify new network members.
65
Network Determination: The Big Idea
1. Media is collected in theater
2. Some drives belong to IED-planting cells
3. Automatically identify links
?
5. New drive arrives
4. Identify new network members.
6. Find links with existing drives
65
Network Determination: The Big Idea
1. Media is collected in theater
2. Some drives belong to IED-planting cells
3. Automatically identify links
?
5. New drive arrives
4. Identify new network members.
6. Find links with existing drives
7. Determine network membership
65
Network Determination: The Big Idea
1. Media is collected in theater
2. Some drives belong to IED-planting cells
3. Automatically identify links
?
5. New drive arrives
4. Identify new network members.
6. Find links with existing drives
7. Determine network membership
“Datasphere"
65
We efficiently process the datasphere by separating metadata, identities, and content.
66
We efficiently process the datasphere by separating metadata, identities, and content.
66
file names, email messages, etc.
We efficiently process the datasphere by separating metadata, identities, and content.
66
file names, email messages, etc.
human names
photographs
camera serial numbers
ABDUL MARTIN
AERO CONTINENT
PATINO ORTIZ
Data space and feature extraction tools
Two tools for data space and feature extraction:
Fiwalk (File and inode Walk):
1. Ingests disk images (raw, EnCase, or AFF format)
2. Automatically extracts metadata (EXIFs, Microsoft Office Doc Properties)
3. Outputs in ASCII, XML, CSV, or ARFF format
“Bulk Extractor”
1. Processes any disk, even damaged/incomplete
2. Scans for email addresses.
3. Outputs list of email addresses & count (sorted by frequency)
67
Hot Drive Detection:Automatically finding “the good stuff.”
We were interested in showing that sensitive information had been left behind on hard drives.
We wrote a Credit Card Number detector.
Bulk Data
CCN Detector
3713 107885 95004
@ offset 554,553,664
4388 5750 2212 4449
@ offset 43,332,334,554
5543 3343 1644 3345
@ offset 6,334,877,809
...
68
CCN Detector: written in flex and C++
Sample output:'CHASE NA|5422-4128-3008-3685| pos=13152133'DISCOVER|6011-0052-8056-4504| pos=13152440.'GE CARD|4055-9000-0378-1959| pos=13152589BANK ONE |4332-2213-0038-0832| pos=13152740.'NORWEST|4829-0000-4102-9233| pos=13153182'SNB CARD|5419-7213-0101-3624| pos=13153332
Test # pass
pattern 3857
Known prefixes 90
CCV1 43
patterns & histogram 38
Disk #105:
69
Histogram analysis:importance on multiple drives is k/(drive count)
Experiment: 429 Drives
Source Countries: China and Mexico
Automatically Constructed Stop List:
Email Address Drive Frequency
72
4 “Subject” Drives withheld from ingest
Questions: Can we find which networks they associate with?
Can we learn why they associate?
Results: Drive X5-22 correlates with X5-25 and X5-28
DrivePair
Correlation Coefficient(linear scale)
X5-22 and X5-25 9.25
X5-22 and X5-28 9.13
X5-22 and X5-27 6.86
X5-22 and X5-26 5.95
X5-22 and X5-03 3.43
X5-22 and X2-09 2.20
X5-22 and X3-07 2.10
X5-22 and X3-21 2.08
X5-22
X5-28
X5-25
X5-03
X5-27
X3-21
X2-09
X5-26
X3-07Most drive pairs have CC << 0.1
73
TF/IDF: Those most important email addresses are those seen on fewest # of drives
Email Address (Sanitized) # Correlation Contribution
[email protected] 2 0.5
[email protected] 2 0.5
[email protected] 2 0.5
[email protected] 2 0.5
[email protected] 2 0.5
[email protected] 2 0.5
[email protected] 3 0.3
[email protected] 3 0.3
[email protected] 3 0.3
[email protected] 3 0.3
[email protected] 4 0.25
[email protected] 4 0.25
[email protected] 4 0.25
[email protected] 4 0.25
[email protected] 4 0.25
[email protected] 5 0.2
[email protected] 5 0.2
[email protected] 5 0.2
[email protected] 5 0.2
[email protected] 5 0.2
TOTAL 9.25Most common domain reveals organization of previous owner
X5-25X5-22
X3-07
X3-21
74
Network Membership Tool
Cross Drive Analysis (with email addresses)
1. Rank all email addresses by drive frequency
2. Top 1/3 most popular email addresses are placed on a stop list.
3. Rate email addresses by inverse drive frequency
• More common email addresses are less important
Network Determination Tool
1. Extract email addresses from Subject drive
2. Compute score for each email address with other drives in datasphere
3. Report highest-scoring drive matches
• Show email addresses most responsible for drive score.
75
Needed: Visualization Approaches
This visualization plots C(d1,d2) for all drives:
76
Drives #74 x #77
25 CCNS
in common
Drives #171 & #172
13 CCNS
in common
Drives #179 & #206
13 CCNS
in common
Same Community College
SameMedical Center
SameCar Dealership
Use encrypted bloom filters for secure field deployment.
Encrypted Bloom filter tool:
! Background BF created from non-Network drives
! In-Network drives are processed with tool:
—Background features removed
—Remaining features produce in-Network BF
! Captured drives are filtered against in-Network BF:
—Presence of features indicates network membership.
Advantages:
! Searches all drives simultaneously
! Bloom filter cannot be reverse-engineered by enemy
!“Practical Applications of Bloom filters to the NIST RDS and hard drive triage,” Farrell, Garfinkel & White, ACSAC 2008
Bloom
present/not present
77
Different targets will have different behavioral profiles.
Behavioral templates can help identify the target.
Future work: searching the datasphere with behavior profiles and templates.
78
Future work: Clustering algorithms to identify "unknown unknowns."
79
sleuthkit.org is the official web site for The Sleuth Kit and Autopsy Browser. Both are open source digital investigation tools
sleuthkit.org is the official web site for The Sleuth Kit and Autopsy Browser. Both are open source digital investigation tools
sleuthkit.org is the official web site for The Sleuth Kit and Autopsy Browser. Both are open source digital investigation tools
sleuthkit.org is the official web site for The Sleuth Kit and Autopsy Browser. Both are open source digital investigation tools
Conclusions
80
Research Collaborations
Data-Sharing Partnerships:
! DC3
! MITRE
! I.D.E.A.L. Technology Corporation (STRIKE)
! Professor Patrick Wolfe, Harvard University, NSF EXP program
! Professor Lorrie Cranor, Carnegie Mellon University
Research Partnerships:
! FBI Silicon Valley Regional Computer Forensic Lab
! Police Department, San Louis Obispo, California
81
In Summary
Automated exploitation systems can be made more powerful by making them aware of the "Datasphere."
The key is to use every piece of available data—don't stop with the obvious content.
Datamining algorithms can be profitably applied to forensic problems.
82
Contraband Information
Bob
Bob
Bob
Bob
Automated
DOMEX
Database
flower.jpg
my_presentation.ppt
memo.doc
sleuthkit.org is the official web site for The Sleuth Kit and Autopsy Browser. Both are open source digital investigation tools
email addresses
credit card numbers
timestamps
hash codes
d41d8cd98f00b204e9800998ecf8427e
Datasphere Repository
Questions?