hadoop distributed file system

27
DFS Distributed File System

Upload: milad-sobhkhiz

Post on 27-Jan-2015

1.516 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Hadoop Distributed File System

DFSDistributed File

System

Page 2: Hadoop Distributed File System

Share Files Easily in Public Folder

Page 3: Hadoop Distributed File System

What about this type of networks?

Page 4: Hadoop Distributed File System

What Is DFS In Real World?

DFS allows administrators to consolidate file

shares that may exist on multiple servers to

appear as though they all live in the same

location so that users can access them from a

single point on the network

Page 5: Hadoop Distributed File System

Example:

Page 6: Hadoop Distributed File System

Benefits of DFS•

Resources management – (users access all resources through a single point)

• Accessibility– (users do not need to know the physical location of the shared folder,

then can navigate to it through Explorer and domain tree)

• Fault tolerance – (shares can be replicated, so if the server in Chicago goes down,

resources still will be available to users)

• Work load management – (DFS allows administrators to distribute shared folders and workloads

across several servers for more efficient network and server resources use)

Page 7: Hadoop Distributed File System

Hadoop

Page 8: Hadoop Distributed File System

Assumptions and Goals (1)

• HDFS instance consist of thousands of server

• HDFS is always non-fuctional

• Automatic recovery is a architectural goals of

HDFS

Page 9: Hadoop Distributed File System

Assumptions and Goals (2)

• HDFS needs streaming access to their DataSets• HDFS is designed for batch processing rather

than interactive use y users

• HDFS has Large DataSets same as GB & TB

Page 10: Hadoop Distributed File System

Assumptions and Goals (3)

• Moving Computation is Cheaper Than Moving

Data

• Portability across Heterogenous HW & SW

Page 11: Hadoop Distributed File System

NameNode and DataNodes (1)

• Master/slave architecture• An HDFS cluster consists of:

- Single NameNode - a Master Server

- Number of DataNodesmanages file system namespace and regulates access to files by clients

One per node in clusterManage storage attached to the nodes they run on

Page 12: Hadoop Distributed File System

NameNode and DataNodes (2)

• Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes

• The NameNode executes file system namespace operations like opening, closing, and renaming files and directories

Page 13: Hadoop Distributed File System

NameNode and DataNodes (3)

• The DataNodes are responsible for serving read and write requests from the file system’s clients

• The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode

Page 14: Hadoop Distributed File System

NameNode and DataNodes (4)

Page 15: Hadoop Distributed File System

NameNode and DataNodes (5)

• HDFS Run a GNU/Linux operating system (OS)

• HDFS is built using the Java language

Page 16: Hadoop Distributed File System

File System NameSpace (1)

• HDFS supports a traditional hierarchical file organization

• HDFS does not yet implement user access permissions

• HDFS does not support hard links or soft links

• NameNode maintains the file system namespace

Page 17: Hadoop Distributed File System

File System NameSpace (2)

• An application can specify the number of replicas of a file that should be maintained by HDFS

• The number of copies of a file is called the replication factor of that file

Page 18: Hadoop Distributed File System

Data Replication (1)• HDFS reliably store very large files across

machines in a large cluster.

• It stores each file as a sequence of blocks

• all blocks except the last block are same size

• The block size and replication factor are configurable per file

Page 19: Hadoop Distributed File System

Data Replication (2)

• NameNode makes all decisions for replication of blocks.

• It periodically receives a Heartbeat and a Blockreport from each of DataNodes in the cluster

Page 20: Hadoop Distributed File System

Data Replication (3)

• Receipt of a Heartbeat implies that the DataNode is functioning properly.

• A Blockreport contains a list of all blocks on a DataNode

Page 21: Hadoop Distributed File System

Data Replication (4)

Page 22: Hadoop Distributed File System

File System Metadata (1)

Name Node Local File

FSImage EditLog

Page 23: Hadoop Distributed File System

File System Metadata (2)

• EditLog– records any changes in File system

• FSimage – Stores blockmaping and filesystem properties

Page 24: Hadoop Distributed File System

File System Metadata (3)

• The NameNode keeps an image of the entire file system namespace and file Blockmap in memory.

• This key metadata is compact, (4GB of RAM = huge number of files)

• checkpoint – NN starts up, it reads the FsImage and EditLog – applies all the transactions from the EditLog to the in-

memory representation of the FsImage– flushes out this new version into a new FsImage on disk.– checkpoint only occurs when the NameNode starts up.

Page 25: Hadoop Distributed File System

• Blockreport– When a DataNode starts up, it scans through its

local file system, generates a list of all HDFS data blocks that correspond to each of these local files and sends this report to the NameNode

Page 26: Hadoop Distributed File System

Robustness

• Cluster Rebalancing

• Data Integrity(checksum)

• Metadata Disk Failure

• Snapshots

Page 27: Hadoop Distributed File System

Refrences:

1. http://www.maxi-pedia.com/what+is+DFS2. www.Apachi.org