big data- hdfs(2nd presentation)
TRANSCRIPT
![Page 1: Big data- HDFS(2nd presentation)](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a9f44e1a28ab421c8b4608/html5/thumbnails/1.jpg)
Presentation on Big Data-HADOOP DISTRIBUTED FILE
SYSTEM(H.D.F.S)
Presented by –
TAKRIM UL ISLAM LASKAR(120103006)
![Page 2: Big data- HDFS(2nd presentation)](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a9f44e1a28ab421c8b4608/html5/thumbnails/2.jpg)
Who uses Hadoop?
• Amazon/A9 • Facebook • Google • IBM : Blue Cloud? • Joost• Last.fm • New York Times • PowerSet• Veoh• Yahoo!
![Page 3: Big data- HDFS(2nd presentation)](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a9f44e1a28ab421c8b4608/html5/thumbnails/3.jpg)
HADOOP DISTRIBUTED FILE SYSTEM(HDFS)
Highly fault tolerant
High throughput
Suitable for application with large data sets
Streaming access to file system data
Can be built out of commodity hardware
![Page 4: Big data- HDFS(2nd presentation)](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a9f44e1a28ab421c8b4608/html5/thumbnails/4.jpg)
Commodity Hardware
Typically in 2 level architecture– Nodes are commodity PCs – Uplink from rack is 3-4 gigabit – Rack-internal is 1 gigabit
![Page 5: Big data- HDFS(2nd presentation)](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a9f44e1a28ab421c8b4608/html5/thumbnails/5.jpg)
Why DFS?
1 MACHINE• 4 I/O channels• Each channel- 100mb/s
10 MACHINES• 4 I/O channels• Each channel- 100mb/s
…...
45 Minutes 4.5 Minutes
READ 1TB DATA
![Page 6: Big data- HDFS(2nd presentation)](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a9f44e1a28ab421c8b4608/html5/thumbnails/6.jpg)
Goals of HDFS
• Very Large Distributed File System
– 10K nodes, 100 million files, 10 PB
• Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
• Optimized for Batch Processing
– Data locations exposed so that computations can move to where data resides
– Provides very high aggregate bandwidth
![Page 7: Big data- HDFS(2nd presentation)](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a9f44e1a28ab421c8b4608/html5/thumbnails/7.jpg)
HDFS Architecture
![Page 8: Big data- HDFS(2nd presentation)](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a9f44e1a28ab421c8b4608/html5/thumbnails/8.jpg)
Functions of a NameNode
• Manages File System Namespace
– Maps a file name to a set of blocks
– Maps a block to the DataNodes where it resides
• Cluster Configuration Management
• Replication Engine for Blocks
![Page 9: Big data- HDFS(2nd presentation)](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a9f44e1a28ab421c8b4608/html5/thumbnails/9.jpg)
NameNode Metadata
• Meta-data in Memory
– The entire metadata is in main memory (RAM)
• Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
• A Transaction Log
– Records file creations, file deletions. etc
![Page 10: Big data- HDFS(2nd presentation)](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a9f44e1a28ab421c8b4608/html5/thumbnails/10.jpg)
DataNode
• A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores meta-data of a block
– Serves data and meta-data to Clients
• Block Report
– Periodically sends a report of all existing blocks to the NameNode• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
![Page 11: Big data- HDFS(2nd presentation)](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a9f44e1a28ab421c8b4608/html5/thumbnails/11.jpg)
Block Placement
• Current Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on same remote rack
-- Additional replicas are randomly placed
• Clients read from nearest replica.
![Page 12: Big data- HDFS(2nd presentation)](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a9f44e1a28ab421c8b4608/html5/thumbnails/12.jpg)
Heartbeats
• DataNodes send heartbeat to the NameNode
– Once every 3 seconds
• NameNode used heartbeats to detect DataNode failure
![Page 13: Big data- HDFS(2nd presentation)](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a9f44e1a28ab421c8b4608/html5/thumbnails/13.jpg)
Replication Engine
• NameNode detects DataNode failures
– Chooses new DataNodes for new replicas
– Balances disk usage
– Balances communication traffic to DataNodes
![Page 14: Big data- HDFS(2nd presentation)](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a9f44e1a28ab421c8b4608/html5/thumbnails/14.jpg)
HDFS- 2nd generation
NameNode-
Active NameNode
Passive NameNode
![Page 15: Big data- HDFS(2nd presentation)](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a9f44e1a28ab421c8b4608/html5/thumbnails/15.jpg)
Data Pipelining
• Client retrieves a list of DataNodes on which to place replicas of a block
• Client writes block to the first DataNode
• The first DataNode forwards the data to the next DataNode in the Pipeline
• When all replicas are written, the Client moves on to write the next block in file
![Page 16: Big data- HDFS(2nd presentation)](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a9f44e1a28ab421c8b4608/html5/thumbnails/16.jpg)
Secondary NameNode
• Copies FsImage and Transaction Log from NameNode to a temporary directory
• Merges FSImage and Transaction Log into a new FSImage in temporary directory
• Uploads new FSImage to the NameNode
– Transaction Log on NameNode is purged
![Page 17: Big data- HDFS(2nd presentation)](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a9f44e1a28ab421c8b4608/html5/thumbnails/17.jpg)
User Interface
• Command for HDFS User:
– hadoop dfs -mkdir /foodir
– hadoop dfs -cat /foodir/myfile.txt
– hadoop dfs -rm /foodir myfile.txt
• Command for HDFS Administrator
– hadoop dfsadmin -report
– hadoop dfsadmin -decommission datanodename
• Web Interface
– http://host:port/dfshealth.jsp
![Page 18: Big data- HDFS(2nd presentation)](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a9f44e1a28ab421c8b4608/html5/thumbnails/18.jpg)
Hadoop Map/Reduce
• The Map-Reduce programming model – Framework for distributed processing of large data sets – Pluggable user code runs in generic framework
• Common design pattern in data processing cat * | grep | sort | unique -c | cat > file input | map | shuffle |
reduce | output
• Natural for: – Log processing – Web search indexing – Ad-hoc queries
![Page 19: Big data- HDFS(2nd presentation)](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a9f44e1a28ab421c8b4608/html5/thumbnails/19.jpg)
References :
1. Youtube Lecture video on chennal ‘ Training on Big Data and Hadoop ’ By User ‘Edureka’.
2. ‘White Book Of Big Data’ By ‘Fujistu’ .
3. ‘Big Data For Dummies’ by ‘A Wiley Brand’ .
4. Research paper by ‘Kalapriya Kannan’ in ‘IBM Research Labs’.
5. Collected data from “slideshare.com”
![Page 20: Big data- HDFS(2nd presentation)](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a9f44e1a28ab421c8b4608/html5/thumbnails/20.jpg)
Useful Links
• HDFS Design:
– http://hadoop.apache.org/core/docs/current/hdfs_design.html
• Hadoop Information
– http://hadoop.apache.org/
• Hadoop API:
• – http://hadoop.apache.org/core/docs/current/api/
![Page 21: Big data- HDFS(2nd presentation)](https://reader030.vdocuments.net/reader030/viewer/2022032422/55a9f44e1a28ab421c8b4608/html5/thumbnails/21.jpg)
-THANK YOU.I appreciate your patience.