presented at sector · hadoop architecture 8 name node daemons web server name node logs file...
TRANSCRIPT
October, 2012
Hadoop Forensics
Kevvie Fowler, GCFA Gold, CISSP, MCTS, MCDBA, MCSD, MCSE
Presented at SecTor
About me
Kevvie Fowler
Day job: Lead the TELUS Intelligent Analysis practice
Night job: Founder and principal consultant, Ringzero Frequent speaker at major security conferences (Black Hat USA, Sector,
Appsec Asia, Microsoft, etc.)
Industry contributions:
2
Data, data and more data
The world uses and stores a lot of data Increased regulatory requirements
Modernizing of industries
Disk space is cheap and people “store and forget”
Industry has experienced 40 to 50% growth in storage over the last 3 years
Several new data management technologies have been released over the past few years to help us manage the problem
3
Technology to the rescue
Data management landscape (SQL, NoSQL, NewSQL, etc.)
4
Source: blogs.the451group.com
The crime scene of today
The digital crime scene has increased
5
What we historically examined over a year will be dwarfed in a single big data investigation
Hadoop the popular big data solution
Business is looking at Hadoop to solve the big data problem
But wait aren’t there security issues with Hadoop?
Secure or not investigations will lead there and investigations will need to occur
6
Hadoop overview
What is this Hadoop?
An opensource framework for large-scale data processing Moves processing to data
Leverages MapReduce to take a query, divide it and run in parallel over multiple commodity nodes
Hadoop is basically an open source MapReduce implementation
Supports structured and unstructured data
7
Hadoop architecture
8
Name Node
Daemons
Web Server
Name Node logs
File System Image
FS Image files
Edit logs
Meta Data Data
Node 1
Web Server
File 1 Block 1
File 2 Block 1
File 2 Block 2
Storage
Task Tracker
Logging
Data Node 3
Web Server
File 1 Block 1
File 2 Block 1
File 2 Block 2
Storage
Task Tracker
Logging
Data Node 2
Web Server
File 1 Block 1
File 2 Block 2
File 2 Block 1
Storage
Task Tracker
Logging
Interfaces
Job- Tracker
Forensics 101
The standard forensic approach | Acquire it all, ask questions later Future science can be applied against it
Repeatable analysis
Many Hadoop clusters consist of huge data sets that can’t practically be acquired in their entirety
This session will cover Hadoop – specific data reduction techniques
Cloudera Hadoop
Hadoop version 2.0
Hadoop HDFS
9
Hadoop artifacts
Hadoop artifacts
Standard OS fare
Connections, Users, Logs (SSH, etc.)
Cluster properties
Configuration, size, servers, key settings
Recent activity
Jobs (active, historical)
Current state
Metadata
– FSImage
– Edit logs
Files
– Only the necessary ones!
Tools Native Hadoop tools (Command line, WebUI)
10
Hadoop artifacts | cluster properties
Cluster properties | via command line
Results Size of Hadoop cluster
Number/addresses of data nodes
11
hdfs dfsadmin -report
Hadoop artifacts | cluster properties
Cluster properties | via WebUI <address>:50070/dfshealth.jsp
12
Hadoop artifacts | cluster properties
Cluster properties | via WebUI <address>:50070/dfshealth.jsp (continued)
13
Hadoop artifacts | cluster properties
Cluster properties | Modes
Modes Standalone | single node & everything in one java process
Pseudo-Distributed | single node & daemons in separate java processes
Fully-Distributed | daemons run on a cluster of machines
Managed via two types of configuration files Default (*-default.xml)
Site-specific (*-site.xml)
14
Hadoop artifacts | cluster properties
10/9/2012 15
File (/hadoop/conf) Description Key properties
core-site.xml General Hadoop properties
fs.trash.interval, fs.trash.checkpoint.interval
hdfs-site.xml HDFS properties dfs.replication, dfs.blocksize dfs.permissions.enabled, hadoop.tmp.dir, dfs.namenode.name.dir, dfs.namenode.checkpoint.dir, dfs.datanode.data.dir
mapred-site.xml MapReduce properties mapreduce.task.tmp.dir mapreduce.jobhistory.done-dir
log4j.properties Logging properties hadoop.mapred.JobTracker hadoop.mapred.TaskTracker namenode.FSNamesystem.audit
Hadoop artifacts | recent activity
Recent activity | Jobs
Jobs are executed from a client JAR file
Each job will have one or more sub tasks managed by the task trackers
Two files are created in conjunction with each executed job
Job Configuration XML file | contain the job configuration as specified when the job is launched
Job Status File | the counters, status, start/stop time, task attempt details etc.
16
Hadoop artifacts | recent activity
Recent activity | Job history
Lists all jobs active (yet to complete) JobID, StartTime, Executing user
JobID can be used to view the job status file which contains job execution details
17
mapred job –list all
Hadoop artifacts | recent activity
Recent activity | Job JAR file
Jar file
18
Hadoop artifacts | recent activity
Recent activity | Job history
Job configuration file (/history)
19
Hadoop artifacts | recent activity
Recent activity | Job history
Job status file (/history)
20
Artifact preservation | metadata
Metadata| FSImage file Contains listing of all directory\files and associated metadata
– Name
– Replication level
– Modification time (Files & directories)
– Access time
– Block size
– Permissions (directories only)
– Etc.
2 saved FSImage files by default
21
Artifact preservation | metadata
Metadata| FSImage file
_nn The last transaction (and all prior) that modified the image
22
Current state of Hadoop | FSImage file
Metadata| FSImage file
Recommended: Force a checkpoint prior to gathering FSImage to refresh the image file
Beware - Requires the cluster to be taken off-line!
Offline Image Viewer (OIV) can be used to gather the current FSImage file
23
hdfs secondarynamenode –checkpoint force
hdfs oiv -i " <file and path> " -o " <file and path> " -p Delimited -delimiter "|"
Current state of Hadoop | FSImage file
Metadata| FSImage file
Results
24
Current state of Hadoop | Edit log
Metadata| Edit log
All writes to files on disk are first completed within the edit log
1M Edit operations are retained by default
25
Current state of Hadoop | Edit log
▪ OP_INVALID
▪ OP_ADD
▪ OP_RENAME_OLD
▪ OP_DELETE
▪ OP_MKDIR
▪ OP_SET_REPLICATION
▪ OP_DATANODE_ADD
▪ OP_DATANODE_REMOVE
▪ OP_SET_PERMISSIONS
▪ OP_SET_OWNER
▪ OP_CLOSE
▪ OP_SET_GENSTAMP
▪ OP_SET_NS_QUOTA
▪ OP_CLEAR_NS_QUOTA
26
▪ OP_TIMES
▪ OP_SET_QUOTA
▪ OP_RENAME
▪ OP_CONCAT_DELETE
▪ OP_SYMLINK
▪ OP_GET_DELEGATION_TOKEN
▪ OP_RENEW_DELEGATION_TOKEN
▪ OP_CANCEL_DELEGATION_TOKEN
▪ OP_UPDATE_MASTER_KEY
▪ OP_REASSIGN_LEASE
▪ OP_END_LOG_SEGMENT
▪ OP_START_LOG_SEGMENT
▪ OP_UPDATE_BLOCKS
Metadata| Edit log (continued)
Operations within the Hadoop editlog
Current state of Hadoop | Edit log
Metadata | Edit log (continued)
27
hdfs oev –i <file and path> -o <file and path> -p stats
Offline Edits Viewer (OEV) with the stats processor dumps a summary of operations within the edit log
OEV can be run online
Current state of Hadoop | Edit log
Metadata| Edit log (continued)
Offline Edits Viewer (OEV) without the stats processor dumps operations performed after the last FSImage file
28
hdfs oev -i "<edit filepath>" –o "<out file>" -v
Current state of Hadoop | Edit log
29
Edit log (continued)
Operation
Transaction ID
File created
Modification time Access time
Allocated block
File permissions
Example of file copied from local system to hadoop
Impersonation overview
Current state of Hadoop | File retrieval
hadoop distcp Uses Map/reduce to move large groups of files
– Parallel file transfer
-p[rbugp] Preserve r: replication number b: block size u: user g: group p: permission
pt: last modification and last access times – planned for a future release
Fast but not the best option Pollutes job and task history
Does not preserve timestamps
30
Hadoop overview | what is it
Fsck command can be used to list the blocks associated with a given file
31
hdfs fsck /<file> -files -blocks -racks
Hadoop overview | what is it
Current state of Hadoop | File retrieval (continued)
Navigate the local FS of the specified data node to the block location
Blocks (files) can be imaged directly from the local FS
32
Hadoop overview | what is it
Hadoop Trash
Often enabled by organizations
When enabled – deleted files are not really deleted
Files are moved to Trash location (../user/<username>/.Trash) where they are scheduled for deletion at a later date\time
33
References
References
www.cloudera.com
hadoop.apache.org
34