hadoop at a glance

HDFS at a glance

Students: An Du – Tan Tran – Toan Do – Vinh Nguyen

Instructor: Professor Lothar Piepmayer

Agenda

1. Design of HDFS2.1. HDFS Concepts – Blocks2.1. HDFS Concepts - Namenode and datanode3.1 Dataflow - Anatomy of a read file3.2 Dataflow - Anatomy of a write file3.3 Dataflow - Coherency model 4. Parallel copying5. Demo - Command line

The Design of HDFS

Very large distributed file systemUp to 10K nodes, 1 billion files, 100PB

Streaming data accessWrite once, read many times

Commodity hardwareFiles are replicated to handle hardware failure

Detect failures and recover from them

Worst fit with

Low-latency data accessLots of small filesMultiple writers, arbitrary file modifications

HDFS Blocks

Normal Filesystem blocks are few kilobytesHDFS has Large block size

Default 64MB Typical 128MB

Unlike a file system for a single disk. A file in HDFS that is smaller than a single block does not occupy a full block

HDFS Blocks

A file is stored in blocks on various nodes in hadoop cluster.

HDFS creates several replication of the data blocks

Each and every data block is replicated to multiple nodes across the cluster.

Dhruba Borthakur - Design and Evolution of the Apache Hadoop File System HDFS.pdf

HDFS Blocks

Why blocks in HDFS so large?

Minimize the cost of seeks=> Make transfer time = disk transfer rate

Benefit of Block abstraction

A file can be larger than any single disk in the network

Simplify the storage subsystemProviding fault tolerance and availability

Namenode & Datanodes

Namenode (master)– manages the filesystem namespace– maintains the filesystem tree and metadata for all the files and directories in the tree.

Datanodes (slaves)– store data in the local file system– Periodically report back to the namenode with lists of all existing blocks

Clients communicate with both namenode and datanodes.

Namenode & Datanodes

Anatomy of a File Read

Benefits:- Avoid “bottle neck”- Multi-Clients

Writing in HDFS

NamenodeDatanodeBlock

Writing in HDFS

Exeptions: Node failedPipeline close, remove block and addr of

failed nodeNamenode arrange new datanode

Coherency Model

Not visible when copyinguse sync()Apply in applications

Parallel copying in HDFS

Transfer data between clusters% hadoop distcp hdfs://namenode1/foo

hdfs://namenode2/barImplemented as MapReduce, each file per mapEach map take at least 256MBDefault max maps is 20 per nodeThe diffirent versions only supported by webhdfs

protocol:% hadoop distcp webhdfs://namenode1:50070/foo

webhdfs://namenode2:50070/bar

Cluster with 03 nodes: 04 GB RAM 02 CPU @ 2.0Ghz+ 100G HDD

Using vmWare on 03 different serversNetwork: 100MbpsOperating System: Ubuntu 11.04

Windows: Not tested

Setup Guide - Single Node

java runtime ssh http://hadoop.apache.org/common/

docs/r1.0.3/single_node_setup.html /etc/hadoop/core-site.xml /etc/hadoop/hdfs-site.xml

Cluster

/etc/hadoop/masters /etc/hadoop/slaves http://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html

Command Line

Similar to *nix hadoop fs -ls / hadoop fs -mkdir /test hadoop fs -rmr /test hadoop fs -cp /1 /2 hadoop fs -copyFromLocal /3 hdfs://localhost/

Namedone-specific: hadoop namenode -format start-all.sh

Command Line

Sorting: Standard method to test cluster TeraGen: Generate dummy data TeraSort: Sort TeraValidate: Validate sort result

Command Line: hadoop jar /usr/share/hadoop/hadoop-examples-

1.0.3.jar terasort hdfs://ubuntu/10GdataUnsorted /10GDataSorted41

Benchmark Result

2 Nodes, 1GB data: 0:03:38 3 Nodes, 1GB data: 0:03:13

2 Nodes, 10GB data: 0:38:07 3 Nodes, 10GB data: 0:31:28

Virtual Machine's harddisks are the bottle-neck

Who wins…

References

Hadoop The Definitive Guide

hadoop at a glance

hadoop distcp hdfs

hadoop cluster

hdfs blocksa file

gb data

apache hadoop file system

hadoop jar

hadoop namenode format

terasort hdfs

Technology

india at glance

at a glance

spark & hadoop at production at scale

hadoop at orange - sophiaconf2012

hadoop playlist (ignite talks at strata + hadoop world 2013)

hadoop robot from ebay at china hadoop summit 2015

hadoop and hive at orbitz, hadoop world 2010

hadoop operations at linkedin

hadoop at spotify - meetupfiles.meetup.com/3171672/hadoop at...

at a glance

hadoop world: hadoop development at facebook: hive and hdfs

hadoop at lookout

druid at hadoop ecosystem

season at a glance season at a glance

gk at glance

india at a glance & world at a glance€¦ · india at a...

hadoop at the center: the next generation of hadoop

hadoop usage at yahoo!

mawana at glance

fi.ieghalaya at a glance -...