presented at sector · hadoop architecture 8 name node daemons web server name node logs file...

October, 2012

Hadoop Forensics

Kevvie Fowler, GCFA Gold, CISSP, MCTS, MCDBA, MCSD, MCSE

Presented at SecTor

About me

Kevvie Fowler

Day job: Lead the TELUS Intelligent Analysis practice

Night job: Founder and principal consultant, Ringzero Frequent speaker at major security conferences (Black Hat USA, Sector,

Appsec Asia, Microsoft, etc.)

Industry contributions:

2

Data, data and more data

The world uses and stores a lot of data Increased regulatory requirements

Modernizing of industries

Disk space is cheap and people “store and forget”

Industry has experienced 40 to 50% growth in storage over the last 3 years

Several new data management technologies have been released over the past few years to help us manage the problem

3

Technology to the rescue

Data management landscape (SQL, NoSQL, NewSQL, etc.)

4

Source: blogs.the451group.com

The crime scene of today

The digital crime scene has increased

5

What we historically examined over a year will be dwarfed in a single big data investigation

Hadoop the popular big data solution

Business is looking at Hadoop to solve the big data problem

But wait aren’t there security issues with Hadoop?

Secure or not investigations will lead there and investigations will need to occur

6

Hadoop overview

What is this Hadoop?

An opensource framework for large-scale data processing Moves processing to data

Leverages MapReduce to take a query, divide it and run in parallel over multiple commodity nodes

Hadoop is basically an open source MapReduce implementation

Supports structured and unstructured data

7

Hadoop architecture

8

Name Node

Daemons

Web Server

Name Node logs

File System Image

FS Image files

Edit logs

Meta Data Data

Node 1

Web Server

File 1 Block 1

File 2 Block 1

File 2 Block 2

Storage

Task Tracker

Logging

Data Node 3

Web Server

File 1 Block 1

File 2 Block 1

File 2 Block 2

Storage

Task Tracker

Logging

Data Node 2

Web Server

File 1 Block 1

File 2 Block 2

File 2 Block 1

Storage

Task Tracker

Logging

Interfaces

Job- Tracker

Forensics 101

The standard forensic approach | Acquire it all, ask questions later Future science can be applied against it

Repeatable analysis

Many Hadoop clusters consist of huge data sets that can’t practically be acquired in their entirety

This session will cover Hadoop – specific data reduction techniques

Cloudera Hadoop

Hadoop version 2.0

Hadoop HDFS

9

Hadoop artifacts

Hadoop artifacts

Standard OS fare

Connections, Users, Logs (SSH, etc.)

Cluster properties

Configuration, size, servers, key settings

Recent activity

Jobs (active, historical)

Current state

Metadata

– FSImage

– Edit logs

Files

– Only the necessary ones!

Tools Native Hadoop tools (Command line, WebUI)

10

Hadoop artifacts | cluster properties

Cluster properties | via command line

Results Size of Hadoop cluster

Number/addresses of data nodes

11

hdfs dfsadmin -report


Cluster properties | via WebUI <address>:50070/dfshealth.jsp

12


Cluster properties | via WebUI <address>:50070/dfshealth.jsp (continued)

13


Cluster properties | Modes

Modes Standalone | single node & everything in one java process

Pseudo-Distributed | single node & daemons in separate java processes

Fully-Distributed | daemons run on a cluster of machines

Managed via two types of configuration files Default (*-default.xml)

Site-specific (*-site.xml)

14


10/9/2012 15

File (/hadoop/conf) Description Key properties

core-site.xml General Hadoop properties

fs.trash.interval, fs.trash.checkpoint.interval

hdfs-site.xml HDFS properties dfs.replication, dfs.blocksize dfs.permissions.enabled, hadoop.tmp.dir, dfs.namenode.name.dir, dfs.namenode.checkpoint.dir, dfs.datanode.data.dir

mapred-site.xml MapReduce properties mapreduce.task.tmp.dir mapreduce.jobhistory.done-dir

log4j.properties Logging properties hadoop.mapred.JobTracker hadoop.mapred.TaskTracker namenode.FSNamesystem.audit

Hadoop artifacts | recent activity

Recent activity | Jobs

Jobs are executed from a client JAR file

Each job will have one or more sub tasks managed by the task trackers

Two files are created in conjunction with each executed job

Job Configuration XML file | contain the job configuration as specified when the job is launched

Job Status File | the counters, status, start/stop time, task attempt details etc.

16


Recent activity | Job history

Lists all jobs active (yet to complete) JobID, StartTime, Executing user

JobID can be used to view the job status file which contains job execution details

17

mapred job –list all


Recent activity | Job JAR file

Jar file

18



Job configuration file (/history)

19



Job status file (/history)

20

Artifact preservation | metadata

Metadata| FSImage file Contains listing of all directory\files and associated metadata

– Name

– Replication level

– Modification time (Files & directories)

– Access time

– Block size

– Permissions (directories only)

– Etc.

2 saved FSImage files by default

21

Artifact preservation | metadata

Metadata| FSImage file

_nn The last transaction (and all prior) that modified the image

22

Current state of Hadoop | FSImage file


Recommended: Force a checkpoint prior to gathering FSImage to refresh the image file

Beware - Requires the cluster to be taken off-line!

Offline Image Viewer (OIV) can be used to gather the current FSImage file

23

hdfs secondarynamenode –checkpoint force

hdfs oiv -i " <file and path> " -o " <file and path> " -p Delimited -delimiter "|"

Current state of Hadoop | FSImage file


Results

24

Current state of Hadoop | Edit log

Metadata| Edit log

All writes to files on disk are first completed within the edit log

1M Edit operations are retained by default

25


▪ OP_INVALID

▪ OP_ADD

▪ OP_RENAME_OLD

▪ OP_DELETE

▪ OP_MKDIR

▪ OP_SET_REPLICATION

▪ OP_DATANODE_ADD

▪ OP_DATANODE_REMOVE

▪ OP_SET_PERMISSIONS

▪ OP_SET_OWNER

▪ OP_CLOSE

▪ OP_SET_GENSTAMP

▪ OP_SET_NS_QUOTA

▪ OP_CLEAR_NS_QUOTA

26

▪ OP_TIMES

▪ OP_SET_QUOTA

▪ OP_RENAME

▪ OP_CONCAT_DELETE

▪ OP_SYMLINK

▪ OP_GET_DELEGATION_TOKEN

▪ OP_RENEW_DELEGATION_TOKEN

▪ OP_CANCEL_DELEGATION_TOKEN

▪ OP_UPDATE_MASTER_KEY

▪ OP_REASSIGN_LEASE

▪ OP_END_LOG_SEGMENT

▪ OP_START_LOG_SEGMENT

▪ OP_UPDATE_BLOCKS

Metadata| Edit log (continued)

Operations within the Hadoop editlog


Metadata | Edit log (continued)

27

hdfs oev –i <file and path> -o <file and path> -p stats

Offline Edits Viewer (OEV) with the stats processor dumps a summary of operations within the edit log

OEV can be run online


Metadata| Edit log (continued)

Offline Edits Viewer (OEV) without the stats processor dumps operations performed after the last FSImage file

28

hdfs oev -i "<edit filepath>" –o "<out file>" -v


29

Edit log (continued)

Operation

Transaction ID

File created

Modification time Access time

Allocated block

File permissions

Example of file copied from local system to hadoop

Impersonation overview

Current state of Hadoop | File retrieval

hadoop distcp Uses Map/reduce to move large groups of files

– Parallel file transfer

-p[rbugp] Preserve r: replication number b: block size u: user g: group p: permission

pt: last modification and last access times – planned for a future release

Fast but not the best option Pollutes job and task history

Does not preserve timestamps

30

Hadoop overview | what is it

Fsck command can be used to list the blocks associated with a given file

31

hdfs fsck /<file> -files -blocks -racks


Current state of Hadoop | File retrieval (continued)

Navigate the local FS of the specified data node to the block location

Blocks (files) can be imaged directly from the local FS

32


Hadoop Trash

Often enabled by organizations

When enabled – deleted files are not really deleted

Files are moved to Trash location (../user/<username>/.Trash) where they are scheduled for deletion at a later date\time

33

References

References

www.cloudera.com

hadoop.apache.org

34

Thank-you!

Additional information and questions:

[email protected] | www.ringzero.ca

35

presented at sector · hadoop architecture 8 name node daemons web server name node logs file...

Documents