the hadoop distributed file system
Post on 30-Dec-2015
32 Views
Preview:
DESCRIPTION
TRANSCRIPT
PaoMin Wu University at Buffalo
The Hadoop Distributed File System
ARCHITECTURE
1. Namenodestores matadata of the systemkeeps all namespace in RAM
2. Datanodeblock replicastores application data
3. HDFS-ClientUser applications access the file system using the HDFSclient
HDFS Client Process
ARCHITECTURE
4. Image and JournalNamespace image = file system metadataPeresistent record of image = checkpoint
5. CheckpointNode (NameNode)Protects file system metadata
6. BackupNode (NameNode)Capable of creating periodic checkpoints
FILE I/O OPERATIONS AND REPLICA MANGEMENT
FILE I/O OPERATIONS AND REPLICA MANGEMENT
Sort Benchmark
Future Work
Problem:NameNode contains all important information
Solution:Allow multiple namespaces(and NameNodes) to share the physical storage within a cluster
PaoMin Wu University at Buffalo
MapReduce: Simplied Data Processing on Large Clusters
Introduction
•key/value pair
•execution across a set of machines
•handling machine failures
•managing the required inter-machine communication
•runs on a large cluster
•powerful interface
•automatic parallelization
•distribution of large-scale computations
Programming Model
Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs.
The Reduce function, also written by the user, acceptsan intermediate key and a set of values for that key.
The intermediate values are supplied to the user's reduce function via an iterator.
Example:
Execution Overflow:
Backup Tasks:
Conclusions
1. Restricting the programming model is beneficial
2. Network bandwidth is a scarce resource
3. Redundant execution can help
References:
The Hadoop Distributed File SystemKonstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert ChanslerYahoo!Sunnyvale, California USA{Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com
MapReduce: Simplied Data Processing on Large ClustersJeffrey Dean and Sanjay Ghemawatjeff@google.com, sanjay@google.comGoogle, Inc.
top related