hadoop and hdfs presented by vijay pratap singh

34
Hadoop Distributed File System (HDFS) SEMINAR GUIDE Mr. PRAMOD PAVITHRAN HEAD OF DIVISION COMPUTER SCIENCE & ENGINEERING SCHOOL OF ENGINEERING, CUSAT PRESENTED BY VIJAY PRATAP SINGH REG NO: 12110083 S7, CS-B ROLL NO: 81

Upload: thevijayps

Post on 06-May-2015

793 views

Category:

Technology


5 download

DESCRIPTION

Hadoop is an open-source software framework . Hadoop framework consists on two main layers Distributed file system (HDFS) Execution engine (MapReduce) Supports data-intensive distributed applications. Licensed under the Apache v2 license.  It enables applications to work with thousands of computation-independent computers and petabytes of data Hadoop is the popular open source implementation of map/reduce MapReduce is a programming model for processing large data sets MapReduce is typically used to do distributed computing on clusters of computers MapReduce can take advantage of locality of data, processing data on or near the storage assets to decrease transmission of data. The model is inspired by the map and reduce functions  "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to slave nodes. The slave node processes the smaller problem, and passes the answer back to its master node. "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the final output Highly scalable file system 6k nodes and 120pb Add commodity servers and disks to scale storage and IO bandwidth Supports parallel reading & processing of data Optimized for streaming reads/writes of large files Bandwidth scales linearly with the number of nodes and disks Fault tolerant & easy management Built in redundancy Tolerate disk and node failure Automatically manages addition/removal of nodes One operator per 3k nodes Very Large Distributed File System 10K nodes, 100 million files, 10PB Assumes Commodity Hardware Files are replicated to handle hardware failure Detect failures and recover from them Optimized for Batch Processing Data locations exposed so that computations can move to where data resides Provides very high aggregate bandwidth Hdfs provides a reliable, scalable and manageable solution for working with huge amounts of data Future secure Hdfs has been deployed in clusters of 10 to 4k datanodes Used in production at companies such as yahoo! , FB , Twitter , ebay Many enterprises including financial companies use hadoop.

TRANSCRIPT

Page 1: HADOOP AND HDFS presented by Vijay Pratap Singh

Hadoop Distributed File System

(HDFS)

SEMINAR GUIDEMr. PRAMOD PAVITHRANHEAD OF DIVISIONCOMPUTER SCIENCE & ENGINEERINGSCHOOL OF ENGINEERING, CUSAT

PRESENTED BY VIJAY PRATAP SINGHREG NO: 12110083S7, CS-BROLL NO: 81

Page 2: HADOOP AND HDFS presented by Vijay Pratap Singh

CONTENTS WHAT IS HADOOP

PROJECT COMPONENTS IN HADOOP

MAP/REDUCE

HDFS

ARCHITECTURE

GOALS OF HADOOP

COMPARISION WITH OTHER SYSTEMS

CONCLUSION

REFERENCES

Page 3: HADOOP AND HDFS presented by Vijay Pratap Singh

WHAT IS HADOOP…???

Page 4: HADOOP AND HDFS presented by Vijay Pratap Singh

WHAT IS HADOOP…???

Page 5: HADOOP AND HDFS presented by Vijay Pratap Singh

WHAT IS HADOOP…???

Page 6: HADOOP AND HDFS presented by Vijay Pratap Singh

WHAT IS HADOOP…???o Hadoop is an open-source software framework .

o Hadoop framework consists on two main layerso Distributed file system (HDFS)o Execution engine (MapReduce)

o Supports data-intensive distributed applications.

o Licensed under the Apache v2 license.

o It enables applications to work with thousands of computation-independent computers and petabytes of data

Page 7: HADOOP AND HDFS presented by Vijay Pratap Singh

WHY HADOOP…???

Page 8: HADOOP AND HDFS presented by Vijay Pratap Singh

PROJECT COMPONENTS IN HADOOP

Page 9: HADOOP AND HDFS presented by Vijay Pratap Singh

MAP/REDUCEo Hadoop is the popular open source implementation of map/reduce

o MapReduce is a programming model for processing large data sets

o MapReduce is typically used to do distributed computing on clusters of computers

o MapReduce can take advantage of locality of data, processing data on or near the storage assets to decrease transmission of data.

oThe model is inspired by the map and reduce functions o"Map" step: The master node takes the input, divides it into smaller sub-problems, and

distributes them to slave nodes. The slave node processes the smaller problem, and passes the answer back to its master node.

o"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the final output

Page 10: HADOOP AND HDFS presented by Vijay Pratap Singh

HDFS

Highly scalable file system◦ 6k nodes and 120pb◦ Add commodity servers and disks to scale storage and IO bandwidth

Supports parallel reading & processing of data◦ Optimized for streaming reads/writes of large files◦ Bandwidth scales linearly with the number of nodes and disks

Fault tolerant & easy management◦ Built in redundancy◦ Tolerate disk and node failure◦ Automatically manages addition/removal of nodes◦ One operator per 3k nodes

Scalable, Reliable & Manageable

Page 11: HADOOP AND HDFS presented by Vijay Pratap Singh

ISSUES IN CURRENT SYSTEM

Page 12: HADOOP AND HDFS presented by Vijay Pratap Singh

BIG DATA

Page 13: HADOOP AND HDFS presented by Vijay Pratap Singh

INCREASING BIG DATA

Page 14: HADOOP AND HDFS presented by Vijay Pratap Singh

HADOOP’S APPROACH

Big DataComputation

Computation

Computation

ComputationCombined Result

Page 15: HADOOP AND HDFS presented by Vijay Pratap Singh

ARCHITECTURE OF HADOOP

Page 16: HADOOP AND HDFS presented by Vijay Pratap Singh

HADOOP MASTER/SLAVE ARCHITECTURE

Page 17: HADOOP AND HDFS presented by Vijay Pratap Singh

MAP REDUCE ENGINE

Page 18: HADOOP AND HDFS presented by Vijay Pratap Singh

MAP REDUCE ENGINE

Page 19: HADOOP AND HDFS presented by Vijay Pratap Singh

ARCHITECTURE OF HDFS

Page 20: HADOOP AND HDFS presented by Vijay Pratap Singh

ARCHITECTURE OF HDFS

Page 21: HADOOP AND HDFS presented by Vijay Pratap Singh

CLIENT INTERACTION TO HADOOP

Page 22: HADOOP AND HDFS presented by Vijay Pratap Singh

• A

Rack 1

DataNode 1

DataNode 9

DataNode 7

Client

F

CBA

Rack 5

NameNode

Rack AwarenessRack 1:DN 1

Rack 2:DN7,9Core Switch

Switch Switch

I want to write file.txt

block A

Ok, Write toData Nodes

[1,7,9]

Ready DN 7+9 Ready

9

Ready!A A

A

HDFS WRITE

Page 23: HADOOP AND HDFS presented by Vijay Pratap Singh

• A

Rack 1

DataNode 1

DataNode 9

DataNode 7

Client

F

CBA

Rack 5

NameNode

Rack AwarenessRack 1:DN 1

Rack 2:DN7,9Core Switch

Switch Switch

A A

A

Block Received

Success

MetadataFile.txt = Blk DN : 1,7,9

A

HDFS WRITE (PIPELINED)

Page 24: HADOOP AND HDFS presented by Vijay Pratap Singh

• A

Rack 1

DataNode 1

DataNode 9

DataNode 7

Client

F

CBA

Rack 5

NameNode

Rack AwarenessRack 1:DN 1

Rack 2:DN7,9Core Switch

Switch Switch

I want to read file.txt block

A

Available at nodes [1,7,9]

A A

A

HDFS READ

Page 25: HADOOP AND HDFS presented by Vijay Pratap Singh

GOALS OF HDFS Very Large Distributed File System

◦ 10K nodes, 100 million files, 10PB

Assumes Commodity Hardware◦ Files are replicated to handle hardware failure◦ Detect failures and recover from them

Optimized for Batch Processing◦ Data locations exposed so that computations can move to where data resides◦ Provides very high aggregate bandwidth

Page 26: HADOOP AND HDFS presented by Vijay Pratap Singh

SCALABILITY OF HADOOP

Page 27: HADOOP AND HDFS presented by Vijay Pratap Singh

EASE TO PROGRAMMERS

Page 28: HADOOP AND HDFS presented by Vijay Pratap Singh

HADOOP VS. OTHER SYSTEMS

Page 29: HADOOP AND HDFS presented by Vijay Pratap Singh

HADOOP USERS

Page 30: HADOOP AND HDFS presented by Vijay Pratap Singh

TO LEARN MORE Source code

◦ http://hadoop.apache.org/version_control.html◦ http://svn.apache.org/viewvc/hadoop/common/trunk/

Hadoop releases◦ http://hadoop.apache.org/releases.html

Contribute to it◦ http://wiki.apache.org/hadoop/HowToContribute

Page 31: HADOOP AND HDFS presented by Vijay Pratap Singh

CONCLUSION Hdfs provides a reliable, scalable and manageable solution for working with huge amounts of data

Future secure Hdfs has been deployed in clusters of 10 to 4k datanodes

◦ Used in production at companies such as yahoo! , FB , Twitter , ebay◦ Many enterprises including financial companies use hadoop

Page 32: HADOOP AND HDFS presented by Vijay Pratap Singh

REFERENCES [1] M. Zukowski, S. Heman, N. Nes, And P. Boncz. Cooperative Scans: Dynamic Bandwidth

Sharing In A DBMS. In VLDB ’07: Proceedings Of The 33rd International Conference On Very Large Data Bases, Pages 23–34, 2007.

[2] Tom White, Hadoop The Definite Guide, O’reilly Media ,Third Edition, May 2012

[3] Jeffrey Shafer, Scott Rixner, And Alan L. Cox, The Hadoop Distributed Filesystem: Balancing Portability And Performance, Rice University, Houston, TX

[4] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, The Hadoop Distributed File System, Yahoo, Sunnyvale, California, USA

[5] Jens Dittrich, Jorge-arnulfo Quian, E-ruiz, Information Systems Group, Efficient Big Data Processing In Hadoop Mapreduce , Saarland University

Page 33: HADOOP AND HDFS presented by Vijay Pratap Singh

Thankyou…

Page 34: HADOOP AND HDFS presented by Vijay Pratap Singh

Queries