hadoop distributed file system

1

Hadoop Distributed File System

2

CONTENTS HDFS Introduction

HDFS Advantages

HDFS Architecture

Daemons

Basic File System Operations

HDFS Permissions and Security

Disadvantages

Only for TCS Internal Training - NextGen Solutions, Kochi

3

INTRODUCTION

HDFS, Hadoop Distributed File System, is a distributed file system which holds a large

amount of data, terabytes or even petabytes.

Is a distributed, scalable, portable file system.

Is based on the design of GFS, Google File System.

Is a block structured file system where individual files are broken into blocks of fixed size.

Runs in a seperate namespace isolated from the contents of your local files.

Files are stored in redundant manner to ensure durability against failure.

Data will be written to HDFS once, read several times.

Updates to existing files in HDFS are not supported; An extension to Hadoop will provide support for appending new data to the ends of file.


4

ADVANTAGES

Can hold large data sets (terabytes to petabytes) through data distribution among multiple machines.

Highly fault tolerant.

High througput through parallel computing.

Streaming access to file system data.

Can be built out of low-cost hardware.

Processing logic close to the data.

Reliability by automatically maintaining multiple copies of data.


5

ARCHITECTURE


6

COMMUNICATION PROTOCOL

All HDFS communictation protocols are layered on top of the TCP/IP protocol.

Client establishes a connection to a configurable TCP port on the namenode machine, using Client protocol.

Datanodes talk to the namenode using Datanode protocol.

Namenode only responds to Remote Procedure Call requests issued by datanodes or clients.


7

RACK AWARENESS

Racks can be considered as a set of rows where each row consists of a group of

machines or nodes.

Large hadoop clusters are arranged in racks.

Communication between nodes in the same rack is more faster (higher bandwidth) than that between nodes spread across different racks.

There will be replicas of block on multiple racks for improved fault tolerance.

HDFS can be made rack-aware by the use of a network topology script.

Master node uses the network topology script to map the network topology of the cluster.

Network topology script receives IP addresses of machines as inputs and returns output as a list of rack names, one for each input.


8

BLOCK REBALANCE

To distribute HDFS data uniformly (equally) across datanodes in a cluster.

Block replicas will be spread across the racks to ensure backups in case of rack failure.

In case of node failure, preference will be given to replicas on the same rack so that cross-rack network I/O is reduced.

Automatic balancer tool included as part of hadoop intelligently balance blocks across the nodes.

Perfect balancing is unlikey to be achieved.

Is more desirable to run the balancing script when the cluster utilisation is minimum.


9

DAEMONS

Daemons are resident programs which constitute to hadoop running.

Is a simple interface to run a program using hadoop.

Include namenode, datanode, secondary namenode, jobtracker and tasktracker.


10

NAMENODE

Namenode is the bookkeeper of HDFS.

Stores all the metadata (data about data) for the filesystem.

Keeps track of how your files are broken down into file blocks.

Checks the overall health of the distributed file system.

HDFS cluster consists of a single namenode, a master server that manages the file system namespace and regulates access to files by clients.

Periodically cheks heartbeat messages from datanode to verify that it is alive.


11

DATANODE

Individual machines in an HDFS cluster which stores data.

There will be multiple datanodes with in an HDFS cluster.

Multiple blocks which constitute a single file will be stored across different datanodes.

Replication of file blocks is available on multiple data nodes to avoid data loss.

HDFS follows a master-slave architecture where there will be a single namenode (Master) and multiple datanodes (Slaves).

Serves read, write requests, performs block creation, deletion, and replication upon instruction from namenode.

Periodically sends heartbeat messages to namenode to inform that it is alive.


12

SECONDARY NAMENODE (SNN)

It is an assistant daemon which helps to monitor the state of HDFS cluster.

There will be a single secondary name node per cluster.

Communicates with the namenode to gather information about HDFS metadata.

Usually runs on a different machine than the primary node as the memory requirements are on the same order as the primary node.

Used as a backup to reconfigure the cluster in case of primary node failure.


13

JOBTRACKER

A service within hadoop that maps tasks to specific nodes within HDFS cluster.

There will be a single jobtracker per cluster and typically runs on a master node.

Client applications submit jobs to the jobtracker.

Communicates with the namenode to determine the location of data.

Locates tasktracker nodes with available slots or near the data and submits work to the chosen tasktracker.

Relaunches the task on a different, selected node in case of task failure.

Periodically checks heartbeats from tasktracker.

Job tracker updates status after work completion.


14

TASKTRACKER

A node in the cluster which accepts tasks - Map, Reduce and Shuffle operations - from

jobtracker.

Initiates separate JVM process to handle the task.

Notifies process completion status to the jobtracker.

Sends heartbeat messages to the jobtracker to notify alive status.

Follows master/slave architecture where jobtracker is the master and tasktracker, the slave.

There will be multiple tasktrackers within an HDFS cluster.


15

BASIC FILE SYSTEM OPERATIONS

Interaction with HDFS is done through a monolithic script called bin/hadoop.

DFS and DFSADMIN are two modules in HDFS that we can interact through.


16

BASIC FILE SYSTEM OPERATIONS (Contd...)

Command format

moduleName indicates which subset of hadoop functionality to use.

-cmd is the name of a specific command within this module to execute.

args... represent arguments.

user@machine:hadoop$ bin/hadoop moduleName -cmd args...


mailto:user@machine

17


Some of the basic commands are given below. Following conventions are used for parameters.

italics denote variables to be filled out by the user.

path means any file or directory name.

src and dest are path names in a directed operation.

localSrc and localDest are paths on the local file system.


18


-ls path Lists the contents of the directoryspecified by path, showing the names,permissions, owner, size and modification date for each entry.

-du path Shows disk usage, in bytes, for all files which match path; filenames are reportedwith the full HDFS protocol prefix.

-mv src dest Moves the file or directory indicated bysrc to dest, within HDFS.

-cp src dest Copies the file or directory identified bysrc to dest, within HDFS.


19


-rm path Removes the file or empty directory identified by path.

-put localSrc dest Copies the file or directory from the localfile system identified by localSrc to dest within the DFS.

-copyFromLocal localSrc dest Identical to -put.

-moveFromLocal localSrc dest Copies the file or directory from the localfile system identified by localSrc to dest within HDFS, then deletes the local copy on success.


20


-get src localDest Copies the file or directory in HDFS identified by src to the local file system path identified by localDest.

-put localSrc dest Copies the file or directory from the localfile system identified by localSrc to dest within the DFS.

-cat filename Displays the contents of filename.

-copyToLocal src localDest Identical to -get


21


-moveToLocal src localDest Works like -get, but deletes the HDFS copy on success.

-mkdir path Creates a directory named path in HDFS.

-help cmd Displays the contents of filename.


22

HDFS PERMISSIONS AND SECURITY

Designed to prevent accidental corruption of data.

Not a strong security model that guarantees denial of access to unauthorized parties.

Each file or directory has 3 permissions; read, write and execute.

Identity is not formally authenticated with HDFS, it is taken from an extrinsic source.

Username used to start the hadoop process is considered to be the super user for HDFS.

Permissions are enabled on HDFS by default.


23

DISADVANTAGES

Provides streaming read performance at the expense of random seek times to arbitrary positions in files.

Does not support existing file updation though future versions are expected to support.

Does not provide a mechanism for local caching of data.

Individual machines are expected to fail on a frequent basis.

Namenode is a single point of failure for an HDFS cluster.


24

REFERENCES

Hadoop Wiki.

Yahoo Hadoop Tutorials.

Introduction to HDFS, Developer Works, IBM.

Hadoop In Action, Chuck Lam.


25

THANK YOU

Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25

hadoop distributed file system

Documents