implementation of hadoop distributed file system protocol on … · 2019-12-21 · implementation...

25
Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage Division

Upload: others

Post on 18-Mar-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

Implementation of Hadoop Distributed File System Protocol on OneFS

Tanuj Khurana EMC Isilon Storage Division

Page 2: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

Outline

HDFS Overview OneFS Overview HDFS protocol on OneFS HDFS protocol server implementation References Q&A

2

Page 3: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

HDFS Overview

Distributed File System Inspired by Google’s GFS Designed for scalability and fault tolerance Fast streaming data access Minimal data motion

Master Slave Architecture NameNode (Master) DataNodes http://hadoop.apache.org/docs/r1.2.1/hdfs_design 3

Page 4: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

HDFS Overview: NameNode

Manages the file-system namespace Stores all metadata in the RAM File names, owners, group, access info Maintains file to blocks mapping Manages block replication

4

Page 5: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

HDFS Overview: DataNode

Stores blocks of files on top of native host OS file-system (e.g. EXT3, ZFS)

Same block is replicated on multiple data nodes for redundancy (typically 3X)

Has no “awareness” of data blocks living elsewhere (only the NameNode does)

5

Page 6: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

HDFS Overview: Workflow

6

HDFS file

file copy2 file copy3

node info

file

node info file copy2 file copy3

file

node info

file copy2 file copy3

file

node info file copy2 file copy3

Node reply Node reply Node reply Node reply node reply

MAP Reduce

MAP Reduce

MAP Reduce

MAP Reduce

MAP Reduce

node info

MAP Reduce

MAP Reduce

MAP Reduce

MAP Reduce

Data

Compute

MAP Reduce

MAP Reduce

MAP Reduce

MAP Reduce

MAP Reduce

MAP Reduce

MAP Reduce

MAP Reduce

MAP Reduce

Compute

Data

Name node

3X

NFS

Name node

Decision Support Databases

Web Click data

OLAP

EDW

HTTP

CIFS

FTP

NFS

Landing Zone Servers

Step 1: Data is copied into the

Landing Zone

Step 2: Data is copied into the

Cluster (3 times)

Step 3: Hadoop Jobs are run

Page 7: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

OneFS Overview

Built from the ground up on FreeBSD Distributed scale-out file system Posix compliant Built in support for Data Protection,

Snapshots, DR, Audit, Deduplication Support for multiple protocols SMB, NFS, HTTP, SWIFT, HDFS

7

Page 8: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

OneFS Overview: Semantics

Symmetric cluster architecture Metadata distributed across all nodes

Globally coherent file system access Distributed lock manager Two-phase commit for all write operations

Reed-Solomon FEC used for data protection

8

Page 9: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

OneFS Overview: Architecture

9

Isilon IQ Storage Layer

Intracluster Communication

Infiniband

Servers

Client/Application Layer Ethernet Layer

Servers

Servers

Page 10: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

HDFS protocol on OneFS

Implements the HDFS interface for Client-NameNode and Client-DataNode

Each Isilon node runs a NameNode and DataNode service

Underlying file system is OneFS

10

Page 11: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

HDFS protocol on OneFS: Architecture

11

Ethernet

R (RHIPE)

PIG

Mahout Hive HBase

Job Tracker ZooKeeper DataNode

Compute Node Compute Node Compute Node

Compute Node Compute Node Compute Node

NameNode

name node

name node

name node

name node data node

Page 12: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

HDFS protocol on OneFS: Benefits Multi-protocol access No data ingestion, faster time to results Single repository for all data

Scale compute and data independently Higher storage efficiency (OneFS: 80% usable) Active-Active NameNode architecture Simultaneous multi-distribution and multi-

Hadoop version support More data management options (Snapshots,

DR, Audit etc …)

12

Page 13: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

HDFS Workflow on OneFS

13

SMB, NFS, HTTP, FTP,

HDFS

node info

node info

node info

node info

MAP Reduce

MAP Reduce

MAP Reduce

MAP Reduce

name node

data node

Isilon

name node

name node

name node

NFS

Decision Support Databases

Web Click data

OLAP

EDW

Step 1: Much or all of the Data lives on

the Isilon/Hadoop Cluster

Step 2: Jobs are run

Hadoop Cluster

Page 14: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

HDFS Protocol Impl: NameNode Most RPCs translate to POSIX system calls setPermission() → chmod(…) setTimes() → utimes(…) create() → open(…, O_CREAT, …)

Other RPCs need creative interpretation getBlockLocations(), addBlock(), abandonBlock() renewLease(), recoverLease()

Implements multiple versions of the protocol V1, V2 and V2.2 Versions have different wire formats

14

Page 15: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

NameNode Connection Routing NameNode is configured as single URL Easy configuration: Set fs.defaultFS to hdfs://smartconnect.isilon.com:8020/

DNS round-robin to distribute across nodes Metadata IOPs get spread out OneFS maintains cross-node consistency

IP Failover plus client retries for resiliency

15

Page 16: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

HDFS protocol Impl: Data Path

16

OneFS Clustered FileSystem

OneFS Node

NameNode

DataNode

OneFS Node

NameNode

DataNode

OneFS Node

NameNode

DataNode

OneFS Node

NameNode

DataNode

Hadoop Node

DFSClient 1) Read/Write Request GetBlockLocations(“/file”) AddBlock(“/file”)

2) Response Read: [ Blk1(locs[3]), Blk2(…), ] Write: [ Blk1(locs[1]), Blk2(…), ]

3) ReadBlock(block) / WriteBlock(block)

4) Data (Zero-Copy)

Page 17: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

HDFS protocol Impl: Data Path

17

Specific to OneFS

Page 18: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

HDFS protocol impl: Rack locality

Configure racks to limit cross switch contention

18

Isilon InfiniBand Switch

Rack Ethernet Switch

Compute

Shuffle

SATA

1+ Gbps

10 Gbps

Core Ethernet Switch

Compute

Shuffle

10 Gbps

… …

IB

Rack Ethernet Switch

Compute

Shuffle

SATA

10 Gbps

Compute

Shuffle

10 Gbps

… …

IB

(used only for copy phase) 1+ Gbps (used only for copy phase)

Isilon HDFS

Isilon HDFS

Isilon HDFS

Isilon HDFS

HDFS I/O ALWAYS comes through a rack-local Isilon node which collects data blocks from all other Isilon nodes across the InfiniBand fabric

Page 19: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

HDFS protocol Impl: Authentication Simple Authentication Username sent in clear-text on wire, requires

name resolution on every access Integrated with different directory services

(AD, LDAP, NIS) Kerberos Authentication One hdfs service SPN for the cluster Kerberos “provider” manages the keytab and

SPNs for both MIT/AD KDC Impersonation supported via “proxyusers”

19

Page 20: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

HDFS protocol Impl: Leases HDFS implements a single-writer, multiple-

reader model Only one client can hold a lease on a file opened

for writing, other clients can still read Clients periodically renew lease by sending

requests to NameNode Leases “expire” On OneFS, leases are cluster aware because of

distributed NameNode architecture Built on top of OneFS Distributed Lock

Manager 20

Page 21: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

HDFS protocol Impl: WebHDFS RESTful API to access HDFS Popular for scripting, toolkits and integration Used by Apache Hue, a popular HDFS file

browser client Runs within the hdfs daemon Communicates with Apache web server over

a unix domain socket using the FastCGI interface

Supports both HTTP/HTTPS Supports SPNEGO via Kerberos

21

Page 22: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

HDFS protocol Impl: Access Zones OneFS solution to Multi-Tenancy that ties

together: Cluster network configuration ( IP Pools) Authentication providers File protocol access

Zone context determined based on the cluster IP address the client connects to

Logically partition cluster into self-contained units

22

Page 23: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

Access Zones + HDFS Per-zone HDFS root directory Limits the file-system namespace view Virtualize all file path accesses (e.g.

/home/user1 -> /ifs/zone1/home/user1) Per-zone HDFS security settings Simple_only / Kerberos_only / All

Per-zone authentication services (AD, LDAP…) Key enabler for HDFS as a Service solution

23

Page 24: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

References EMC Isilon OneFS Overview

http://www.emc.com/collateral/hardware/white-papers/h10719-isilon-onefs-technical-overview-wp.pdf

EMC Isilon Hadoop White Paper http://www.emc.com/collateral/software/white-papers/h10528-wp-hadoop-on-isilon.pdf

Isilon Hadoop Best Practices http://www.emc.com/collateral/white-paper/h12877-wp-emc-isilon-hadoop-best-practices.pdf

EMC Hadoop Starter Kit https://community.emc.com/docs/DOC-26892

24

Page 25: Implementation of Hadoop Distributed File System Protocol on … · 2019-12-21 · Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage

2014 Storage Developer Conference. © EMC CORPORATION. All Rights Reserved.

Questions ?

25