hadoop distributed file system

25
1 Hadoop Distributed File System

Upload: bujjijuly

Post on 20-Nov-2015

21 views

Category:

Documents


2 download

DESCRIPTION

h

TRANSCRIPT

  • 1

    Hadoop Distributed File System

  • 2

    CONTENTS HDFS Introduction

    HDFS Advantages

    HDFS Architecture

    Daemons

    Basic File System Operations

    HDFS Permissions and Security

    Disadvantages

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 3

    INTRODUCTION

    HDFS, Hadoop Distributed File System, is a distributed file system which holds a large

    amount of data, terabytes or even petabytes.

    Is a distributed, scalable, portable file system.

    Is based on the design of GFS, Google File System.

    Is a block structured file system where individual files are broken into blocks of fixed size.

    Runs in a seperate namespace isolated from the contents of your local files.

    Files are stored in redundant manner to ensure durability against failure.

    Data will be written to HDFS once, read several times.

    Updates to existing files in HDFS are not supported; An extension to Hadoop will provide support for appending new data to the ends of file.

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 4

    ADVANTAGES

    Can hold large data sets (terabytes to petabytes) through data distribution among multiple machines.

    Highly fault tolerant.

    High througput through parallel computing.

    Streaming access to file system data.

    Can be built out of low-cost hardware.

    Processing logic close to the data.

    Reliability by automatically maintaining multiple copies of data.

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 5

    ARCHITECTURE

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 6

    COMMUNICATION PROTOCOL

    All HDFS communictation protocols are layered on top of the TCP/IP protocol.

    Client establishes a connection to a configurable TCP port on the namenode machine, using Client protocol.

    Datanodes talk to the namenode using Datanode protocol.

    Namenode only responds to Remote Procedure Call requests issued by datanodes or clients.

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 7

    RACK AWARENESS

    Racks can be considered as a set of rows where each row consists of a group of

    machines or nodes.

    Large hadoop clusters are arranged in racks.

    Communication between nodes in the same rack is more faster (higher bandwidth) than that between nodes spread across different racks.

    There will be replicas of block on multiple racks for improved fault tolerance.

    HDFS can be made rack-aware by the use of a network topology script.

    Master node uses the network topology script to map the network topology of the cluster.

    Network topology script receives IP addresses of machines as inputs and returns output as a list of rack names, one for each input.

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 8

    BLOCK REBALANCE

    To distribute HDFS data uniformly (equally) across datanodes in a cluster.

    Block replicas will be spread across the racks to ensure backups in case of rack failure.

    In case of node failure, preference will be given to replicas on the same rack so that cross-rack network I/O is reduced.

    Automatic balancer tool included as part of hadoop intelligently balance blocks across the nodes.

    Perfect balancing is unlikey to be achieved.

    Is more desirable to run the balancing script when the cluster utilisation is minimum.

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 9

    DAEMONS

    Daemons are resident programs which constitute to hadoop running.

    Is a simple interface to run a program using hadoop.

    Include namenode, datanode, secondary namenode, jobtracker and tasktracker.

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 10

    NAMENODE

    Namenode is the bookkeeper of HDFS.

    Stores all the metadata (data about data) for the filesystem.

    Keeps track of how your files are broken down into file blocks.

    Checks the overall health of the distributed file system.

    HDFS cluster consists of a single namenode, a master server that manages the file system namespace and regulates access to files by clients.

    Periodically cheks heartbeat messages from datanode to verify that it is alive.

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 11

    DATANODE

    Individual machines in an HDFS cluster which stores data.

    There will be multiple datanodes with in an HDFS cluster.

    Multiple blocks which constitute a single file will be stored across different datanodes.

    Replication of file blocks is available on multiple data nodes to avoid data loss.

    HDFS follows a master-slave architecture where there will be a single namenode (Master) and multiple datanodes (Slaves).

    Serves read, write requests, performs block creation, deletion, and replication upon instruction from namenode.

    Periodically sends heartbeat messages to namenode to inform that it is alive.

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 12

    SECONDARY NAMENODE (SNN)

    It is an assistant daemon which helps to monitor the state of HDFS cluster.

    There will be a single secondary name node per cluster.

    Communicates with the namenode to gather information about HDFS metadata.

    Usually runs on a different machine than the primary node as the memory requirements are on the same order as the primary node.

    Used as a backup to reconfigure the cluster in case of primary node failure.

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 13

    JOBTRACKER

    A service within hadoop that maps tasks to specific nodes within HDFS cluster.

    There will be a single jobtracker per cluster and typically runs on a master node.

    Client applications submit jobs to the jobtracker.

    Communicates with the namenode to determine the location of data.

    Locates tasktracker nodes with available slots or near the data and submits work to the chosen tasktracker.

    Relaunches the task on a different, selected node in case of task failure.

    Periodically checks heartbeats from tasktracker.

    Job tracker updates status after work completion.

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 14

    TASKTRACKER

    A node in the cluster which accepts tasks - Map, Reduce and Shuffle operations - from

    jobtracker.

    Initiates separate JVM process to handle the task.

    Notifies process completion status to the jobtracker.

    Sends heartbeat messages to the jobtracker to notify alive status.

    Follows master/slave architecture where jobtracker is the master and tasktracker, the slave.

    There will be multiple tasktrackers within an HDFS cluster.

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 15

    BASIC FILE SYSTEM OPERATIONS

    Interaction with HDFS is done through a monolithic script called bin/hadoop.

    DFS and DFSADMIN are two modules in HDFS that we can interact through.

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 16

    BASIC FILE SYSTEM OPERATIONS (Contd...)

    Command format

    moduleName indicates which subset of hadoop functionality to use.

    -cmd is the name of a specific command within this module to execute.

    args... represent arguments.

    user@machine:hadoop$ bin/hadoop moduleName -cmd args...

    Only for TCS Internal Training - NextGen Solutions, Kochi

    mailto:user@machine

  • 17

    BASIC FILE SYSTEM OPERATIONS (Contd...)

    Some of the basic commands are given below. Following conventions are used for parameters.

    italics denote variables to be filled out by the user.

    path means any file or directory name.

    src and dest are path names in a directed operation.

    localSrc and localDest are paths on the local file system.

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 18

    BASIC FILE SYSTEM OPERATIONS (Contd...)

    -ls path Lists the contents of the directoryspecified by path, showing the names,permissions, owner, size and modification date for each entry.

    -du path Shows disk usage, in bytes, for all files which match path; filenames are reportedwith the full HDFS protocol prefix.

    -mv src dest Moves the file or directory indicated bysrc to dest, within HDFS.

    -cp src dest Copies the file or directory identified bysrc to dest, within HDFS.

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 19

    BASIC FILE SYSTEM OPERATIONS (Contd...)

    -rm path Removes the file or empty directory identified by path.

    -put localSrc dest Copies the file or directory from the localfile system identified by localSrc to dest within the DFS.

    -copyFromLocal localSrc dest Identical to -put.

    -moveFromLocal localSrc dest Copies the file or directory from the localfile system identified by localSrc to dest within HDFS, then deletes the local copy on success.

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 20

    BASIC FILE SYSTEM OPERATIONS (Contd...)

    -get src localDest Copies the file or directory in HDFS identified by src to the local file system path identified by localDest.

    -put localSrc dest Copies the file or directory from the localfile system identified by localSrc to dest within the DFS.

    -cat filename Displays the contents of filename.

    -copyToLocal src localDest Identical to -get

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 21

    BASIC FILE SYSTEM OPERATIONS (Contd...)

    -moveToLocal src localDest Works like -get, but deletes the HDFS copy on success.

    -mkdir path Creates a directory named path in HDFS.

    -help cmd Displays the contents of filename.

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 22

    HDFS PERMISSIONS AND SECURITY

    Designed to prevent accidental corruption of data.

    Not a strong security model that guarantees denial of access to unauthorized parties.

    Each file or directory has 3 permissions; read, write and execute.

    Identity is not formally authenticated with HDFS, it is taken from an extrinsic source.

    Username used to start the hadoop process is considered to be the super user for HDFS.

    Permissions are enabled on HDFS by default.

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 23

    DISADVANTAGES

    Provides streaming read performance at the expense of random seek times to arbitrary positions in files.

    Does not support existing file updation though future versions are expected to support.

    Does not provide a mechanism for local caching of data.

    Individual machines are expected to fail on a frequent basis.

    Namenode is a single point of failure for an HDFS cluster.

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 24

    REFERENCES

    Hadoop Wiki.

    Yahoo Hadoop Tutorials.

    Introduction to HDFS, Developer Works, IBM.

    Hadoop In Action, Chuck Lam.

    Only for TCS Internal Training - NextGen Solutions, Kochi

  • 25

    THANK YOU

    Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25