hadoop distributed file system · 2019-02-07 · distributed file system hold a large amount of...

55
Hadoop Distributed File System By Mr.D.B.Shanmugam Associate Professor & HOD Sri Balaji Chockalingam Engineering College, Arni 1

Upload: others

Post on 08-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Hadoop Distributed File

System

By

Mr.D.B.Shanmugam

Associate Professor & HOD

Sri Balaji Chockalingam Engineering

College, Arni 1

Page 2: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Overview

Distributed File System

History of HDFS

What is HDFS

HDFS Architecture

File commands

Demonstration

Sri Balaji Chockalingam Engineering

College, Arni 2

Page 3: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Distributed File System

Hold a large amount of data

Clients distributed across a network

Network File System(NFS)o Straightforward design

o remote access- single machine

o Constraints

Sri Balaji Chockalingam Engineering

College, Arni 3

Page 4: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

History

Sri Balaji Chockalingam Engineering

College, Arni 4

Page 5: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

History

Apache Nutch – open source web engine-

2002

Scaling issue

Publication of GFS paper in 2003-

addressed Nutch’s scaling issues

2004 – Nutch distributed File System

2006 – Apache Hadoop – MapReduce and

HDFS

Sri Balaji Chockalingam Engineering

College, Arni 5

Page 6: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

HDFS

Terabytes or Petabytes of data

Larger files than NFS

Reliable

Fast, Scalable access

Integrate well with Map Reduce

Restricted to a class of applications

Sri Balaji Chockalingam Engineering

College, Arni 6

Page 7: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

HDFS versus NFS

Single machine makes part of its file system available to other machines

Sequential or random access

PRO: Simplicity, generality, transparency

CON: Storage capacity and throughput limited by single server

Sri Balaji Chockalingam Engineering College, Arni

Single virtual file system spread over

many machines

Optimized for sequential read and

local accesses

PRO: High throughput, high capacity

"CON": Specialized for particular

types of applications

Network File System (NFS) Hadoop Distributed File System (HDFS)

Page 8: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

HDFS

Sri Balaji Chockalingam Engineering

College, Arni 8

Page 9: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Basics

Distributed File System of Hadoop

Runs on commodity hardware

Stream data at high bandwidth

Challenge –tolerate node failure without

data loss

Simple Coherency model

Computation is near the data

Portability – built using Java

Sri Balaji Chockalingam Engineering

College, Arni 9

Page 10: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Basics

Interface patterned after UNIX file

system

File system metadata and application data

stored separately

Metadata is on dedicated server called

Namenode

Application data on data nodes

Sri Balaji Chockalingam Engineering

College, Arni 10

Page 11: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Basics

HDFS is good for

◦ Very large files

◦ Streaming data access

◦ Commodity hardware

Sri Balaji Chockalingam Engineering

College, Arni 11

Page 12: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Basics

HDFS is not good for

◦ Low-latency data access

◦ Lots of small files

◦ Multiple writers, arbitrary file modifications

Sri Balaji Chockalingam Engineering

College, Arni 12

Page 13: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Differences from GFS

Only Single writer per file

Open Source

Sri Balaji Chockalingam Engineering

College, Arni 13

Page 14: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

HDFS Architecture

Sri Balaji Chockalingam Engineering

College, Arni 14

Page 15: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

HDFS Concepts

Namespace

Blocks

Namenodes and Datanodes

Secondary Namenode

Sri Balaji Chockalingam Engineering

College, Arni 15

Page 16: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

HDFS Namespace

Hierarchy of files and directories

In RAM

Represented on Namenode by inodes

Attributes- permissions, modification and

access times, namespace and disk space

quotas

Sri Balaji Chockalingam Engineering

College, Arni 16

Page 17: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Blocks

HDFS blocks are either 64MB or 128MB

Large blocks-minimize the cost of seeks

Benefits-can take advantage of any disks in

the cluster

Simplifies the storage subsystem-amount

of metadata storage per file is reduced

Fit well with replication

Sri Balaji Chockalingam Engineering

College, Arni 17

Page 18: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Namenodes and Datanodes

Master-worker pattern

Single Namenode-master server

Number of Datanodes-usually one per

node in the cluster

Sri Balaji Chockalingam Engineering

College, Arni 18

Page 19: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Namenode

Master

Manages filesystem namespace

Maintains filesystem tree and metadata-

persistently on two files-namespace image

and editlog

Stores locations of blocks-but not

persistently

Metadata – inode data and the list of

blocks of each fileSri Balaji Chockalingam Engineering

College, Arni 19

Page 20: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Datanodes

Workhorses of the filesystem

Store and retrieve blocks

Send blockreports to Namenode

Do not use data protection mechanisms

like RAID…use replication

Sri Balaji Chockalingam Engineering

College, Arni 20

Page 21: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Datanodes

Two files-one for data, other for block’s

metadata including checksums and

generation stamp

Size of data file equals actual length of

block

Sri Balaji Chockalingam Engineering

College, Arni 21

Page 22: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

DataNodes

Startup-handshake:o Namespace ID

o Software version

Sri Balaji Chockalingam Engineering

College, Arni 22

Page 23: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Datanodes

After handshake:o Registration

o Storage ID

o Block Report

o Heartbeats

Sri Balaji Chockalingam Engineering

College, Arni 23

Page 24: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Sri Balaji Chockalingam Engineering

College, Arni 24

Page 25: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Secondary Namenode

If namenode fails, the filesystem cannot be used

Two ways to make it resilient to failure:

o Backup of files

o Secondary Namenode

Sri Balaji Chockalingam Engineering

College, Arni 25

Page 26: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Secondary Namenode

Periodically merge namespace image with editlog

Runs on separate physical machine

Has a copy of metadata, which can be used to reconstruct state of

the namenode

Disadvantage: state lags that of the primary namenode

Renamed as CheckpointNode (CN) in 0.21 release[1]

Periodic and is not continuous

If the NameNode dies, it does not take over the responsibilities of

the NN

Sri Balaji Chockalingam Engineering

College, Arni 26

Page 27: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

HDFS Client

Code library that exports the HDFS file

system interface

Allows user applications to access the file

system

Sri Balaji Chockalingam Engineering

College, Arni 27

Page 28: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

File I/O Operations

Sri Balaji Chockalingam Engineering

College, Arni 28

Page 29: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Write Operation

Once written, cannot be altered, only

append

HDFS Client-lease for the file

Renewal of lease

Lease – soft limit, hard limit

Single-writer multiple-reader model

Sri Balaji Chockalingam Engineering

College, Arni 29

Page 30: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

HDFS Write

Sri Balaji Chockalingam Engineering

College, Arni 30

Page 31: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Write Operation

Block allocation

Hflush operation

Renewal of lease

Lease – soft limit, hard limit

Single-writer multiple-reader model

Sri Balaji Chockalingam Engineering

College, Arni 31

Page 32: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Data pipeline during block construction

Sri Balaji Chockalingam Engineering

College, Arni 32

Page 33: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Sri Balaji Chockalingam Engineering

College, Arni 33

Creation of new file

Page 34: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Read Operation

Checksums

Verification

Sri Balaji Chockalingam Engineering

College, Arni 34

Page 35: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

HDFS Read

Sri Balaji Chockalingam Engineering

College, Arni 35

Page 36: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Replication

Multiple nodes for reliability

Additionally, data transfer bandwidth is

multiplied

Computation is near the data

Replication factor

Sri Balaji Chockalingam Engineering

College, Arni 36

Page 37: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Image and Journal

State is stored in two files:

fsimage: Snapshot of file system metadata

editlog: Changes since last snapshot

Normal Operation:

When namenode starts, it reads fsimage and then applies all the

changes from edits sequentially

Sri Balaji Chockalingam Engineering

College, Arni 37

Page 38: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Snapshots

Persistently save current state

Instruction during handshake

Sri Balaji Chockalingam Engineering

College, Arni 38

Page 39: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Block Placement

Nodes spread across multiple racks

Nodes of rack share a switch

Placement of replicas critical for reliability

Sri Balaji Chockalingam Engineering

College, Arni 39

Page 40: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Sri Balaji Chockalingam Engineering

College, Arni 40

Page 41: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Replication Management

Replication factor

Under-replication

Over-replication

Sri Balaji Chockalingam Engineering

College, Arni 41

Page 42: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Balancer

Balance disk space usage

Optimize by minimizing the inter-rack

data copying

Sri Balaji Chockalingam Engineering

College, Arni 42

Page 43: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Block Scanner

Periodically scan and verify checksums

Verification succeeded?

Corrupt block?

Sri Balaji Chockalingam Engineering

College, Arni 43

Page 44: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Decommisioning

Removal of nodes without data loss

Retired on a schedule

No blocks are entirely replicated

Sri Balaji Chockalingam Engineering

College, Arni 44

Page 45: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

HDFS –What does it choose in CAP

Partition Tolerance – can handle loosing

data nodes

Consistency

Steps towards Availability: Backup Node

Sri Balaji Chockalingam Engineering

College, Arni 45

Page 46: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Backup Node

NameNode streams transaction log to BackupNode

BackupNode applies log to in-memory and disk image

Always commit to disk before success to NameNode

If it restarts, it has to catch up with NameNode

Available in HDFS 0.21 release

Limitations:

o Maximum of one per Namenode

o Namenode does not forward Block Reports

o Time to restart from 2 GB image, 20M files + 40 M blocks

3 – 5 minutes to read the image from disk

30 min to process block reports

BackupNode will still take 30 minutes to failover!

Sri Balaji Chockalingam Engineering

College, Arni 46

Page 47: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Files in HDFS

Sri Balaji Chockalingam Engineering

College, Arni 47

Page 48: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

File Permissions

Three types:

◦ Read permission (r)

◦ Write permission (w)

◦ Execute Permission (x)

Owner

Group

Mode

Sri Balaji Chockalingam Engineering

College, Arni 48

Page 49: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Command Line Interface

Sri Balaji Chockalingam Engineering

College, Arni 49

Page 50: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

hadoop fs –help

hadoop fs –ls : List a directory

hadoop fs mkdir : makes a directory in HDFS

copyFromLocal : Copies data to HDFS from local filesystem

copyToLocal : Copies data to local filesystem

hadoop fs –rm : Deletes a file in HDFS

More:

https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html

Sri Balaji Chockalingam Engineering

College, Arni 50

Page 51: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Accessing HDFS directly from JAVA

Programs can read or write HDFS files directly

Files are represented as URIs

Access is via the FileSystem API

o To get access to the file: FileSystem.get()

o For reading, call open() -- returns InputStream

o For writing, call create() -- returns OutputStream

Sri Balaji Chockalingam Engineering

College, Arni 51

Page 52: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Interfaces

Getting data in and out of HDFS through the command-line interface

is a bit cumbersome

Alternatives:

FUSE file system: Allows HDFS to be mounted under Unix

WebDAV Share: Can be mounted as filesystem on many OSes

HTTP: Read access through namenode’s embedded web svr

FTP: Standard FTP interface

Sri Balaji Chockalingam Engineering

College, Arni 52

Page 53: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Demonstration

Sri Balaji Chockalingam Engineering

College, Arni 53

Page 54: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Questions?

Sri Balaji Chockalingam Engineering

College, Arni 54

Page 55: Hadoop Distributed File System · 2019-02-07 · Distributed File System Hold a large amount of data Clients distributed across a network Network File System(NFS) o Straightforward

Thankyou

Sri Balaji Chockalingam Engineering

College, Arni 55