the hadoop distributed file system

Post on 30-Dec-2015

32 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

The Hadoop Distributed File System. PaoMin Wu University at Buffalo. Namenode stores matadata of the system keeps all namespace in RAM Datanode block replica stores application data 3. HDFS-Client User applications access the file system using the HDFS client. ARCHITECTURE. - PowerPoint PPT Presentation

TRANSCRIPT

PaoMin Wu University at Buffalo

The Hadoop Distributed File System

ARCHITECTURE

1. Namenodestores matadata of the systemkeeps all namespace in RAM

2. Datanodeblock replicastores application data

3. HDFS-ClientUser applications access the file system using the HDFSclient

HDFS Client Process

ARCHITECTURE

4. Image and JournalNamespace image = file system metadataPeresistent record of image = checkpoint

5. CheckpointNode (NameNode)Protects file system metadata

6. BackupNode (NameNode)Capable of creating periodic checkpoints

FILE I/O OPERATIONS AND REPLICA MANGEMENT

FILE I/O OPERATIONS AND REPLICA MANGEMENT

Sort Benchmark

Future Work

Problem:NameNode contains all important information

Solution:Allow multiple namespaces(and NameNodes) to share the physical storage within a cluster

PaoMin Wu University at Buffalo

MapReduce: Simplied Data Processing on Large Clusters

Introduction

•key/value pair

•execution across a set of machines

•handling machine failures

•managing the required inter-machine communication

•runs on a large cluster

•powerful interface

•automatic parallelization

•distribution of large-scale computations

Programming Model

Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs.

The Reduce function, also written by the user, acceptsan intermediate key and a set of values for that key.

The intermediate values are supplied to the user's reduce function via an iterator.

Example:

Execution Overflow:

Backup Tasks:

Conclusions

1. Restricting the programming model is beneficial

2. Network bandwidth is a scarce resource

3. Redundant execution can help

References:

The Hadoop Distributed File SystemKonstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert ChanslerYahoo!Sunnyvale, California USA{Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com

MapReduce: Simplied Data Processing on Large ClustersJeffrey Dean and Sanjay Ghemawatjeff@google.com, sanjay@google.comGoogle, Inc.

top related