samfineberg big data hadoop storage options 3v9 - snia · pdf filebig data storage options for...

44
PRESENTATION TITLE GOES HERE Big Data Storage Options for Hadoop Sam Fineberg/HP Storage Division

Upload: buituong

Post on 17-Mar-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

PRESENTATION TITLE GOES HEREBig Data Storage Options for Hadoop

Sam Fineberg/HP Storage Division

Page 2: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

SNIA Legal Notice

The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations and literature under the following conditions:

Any slide or slides used must be reproduced in their entirety without modificationThe SNIA must be acknowledged as the source of any material used in the body of any document containing material from these presentations.

This presentation is a project of the SNIA Education Committee.Neither the author nor the presenter is an attorney and nothing in this presentation is intended to be, or should be construed as legal advice or an opinion of counsel. If you need legal advice or a legal opinion please contact your attorney.The information presented herein represents the author's personal opinion and current understanding of the relevant issues involved. The author, the presenter, and the SNIA do not assume any responsibility or liability for damages arising out of any reliance on or use of this information.

NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK.

2

Page 3: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Abstract

Big Data Storage Options for HadoopThe Hadoop system was developed to enable the transformation and analysis of vast amounts of structured and unstructured information. It does this by implementing an algorithm called MapReduce across compute clusters that may consist of hundreds or even thousands of nodes. In this presentation Hadoop will be looked at from a storage perspective. The tutorial will describe the key aspects of Hadoop storage, the built-in Hadoop file system (HDFS), and some other options for Hadoop storage that exist in the commercial and open source communities.

3

Page 4: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Overview

IntroductionWhat is HadoopWhat is MapReduceHow does Hadoop use storageDistributed filesystem concepts

Storage optionsNative Hadoop – HDFS

On direct attached storageOn networked (SAN) storage

Alternative distributed filesystemsCloud object storageEmerging options

4

Page 5: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Overview

IntroductionWhat is HadoopWhat is MapReduceHow does Hadoop use storageDistributed filesystem concepts

Storage optionsNative Hadoop – HDFS

On direct attached storageOn networked (SAN) storage

Alternative distributed filesystemsCloud object storageEmerging options

5

Page 6: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

What is Hadoop?

A scalable fault-tolerant distributed system for data storage and processingCore Hadoop has two main components

MapReduce: fault-tolerant distributed processingProgramming model for processing sets of dataMapping inputs to outputs and reducing the output of multiple Mappers to one (or a few) answer(s)

Hadoop Distributed File System (HDFS): high-bandwidth clustered storage

Distributed file system optimized for large filesOperates on unstructured and structured dataA large and active ecosystemWritten in JavaOpen source under the friendly Apache License

http://hadoop.apache.org

6

Page 7: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

What is MapReduce?

A method for distributing a task across multiple nodesEach node processes data stored on that nodeConsists of two developer-created phases1. Map2. ReduceIn between Map and Reduce is the Shuffle and Sort

7

Page 8: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

MapReduce

8

© Google, from Google Code University, http://code.google.com/edu/parallel/mapreduce-tutorial.html

Page 9: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

What was the max temperature for the last century?

MapReduce Operation

9

Page 10: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Key MapReduce Terminology Concepts

A user runs a client program (typically a Java application) on a client computerThe client program submits a job to HadoopThe job is sent to the JobTracker process on the Master NodeEach Slave Node runs a process called the TaskTrackerThe JobTracker instructs TaskTrackers to run and monitor tasksA task attempt is an instance of a task running on a slave nodeThere will be at least as many task attempts as there are tasks which need to be performed

10

Page 11: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

MapReduce in Hadoop

11

© Google, from Google Code University, http://code.google.com/edu/parallel/mapreduce-tutorial.html

Task Tracker

Input (HDFS)

Output (HDFS)

Mapper

Reducer

Worker=Tasks

Page 12: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

MapReduce: Basic Concepts

Each Mapper processes single input split from HDFSHadoop passes one record at a time to the developer’s Map code Each record has a key and a valueIntermediate data written by the Mapper to local disk (not HDFS) on each of the individual cluster nodes

intermediate data is reliable or globally accessible

During shuffle and sort phase, all values associated with same intermediate key are transferred to same ReducerReducer is passed each key and a list of all its valuesOutput from Reducers is written to HDFS

12

Page 13: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

What is a Distributed File System?

A distributed file system is a file system that allows access to files from multiple hosts across a network

A network filesystem (NFS/CIFS) is a type of distributed file system – more tuned for file sharing than distributed computationDistributed computing applications, like Hadoop, utilize a tightly coupled distributed file system

Tightly coupled distributed filesystemsProvide a single global namespace across all nodesSupport multiple initiators, multiple disk nodes, multiple access to files – file parallelismExamples include HDFS, GlusterFS, pNFS, as well as many commercial and research systems

13

Page 14: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Overview

IntroductionWhat is HadoopWhat is MapReduceHow does Hadoop use storageDistributed filesystem concepts

Storage optionsNative Hadoop – HDFS

On direct attached storageOn networked (SAN) storage

Alternative distributed filesystemsCloud object storageEmerging options

14

Page 15: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Hadoop Distributed File System - HDFS

ArchitectureJava application, not deeply integrated with the server OSLayered on top of a standard FS (e.g., ext, xfs, etc.)Must use Hadoop or a special library to access HDFS filesShared-nothing, all nodes have direct attached disksWrite once filesystem – must copy a file to modify it

HDFS basicsData is organized into files & directoriesFiles are divided into 64-128MB blocks, distributed across nodesBlock placement is handled by the “NameNode”Placement coordinated with job tracker = writes always co-located, reads co-located with computation whenever possibleBlocks replicated to handle failure, replica blocks can be used by compute tasksChecksums used to ensure data integrity

Replication: one and only strategy for error handling, recovery and fault tolerance

Self HealingMakes multiple copies (typically 3)

15

Page 16: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

HDFS on local DAS

A Hadoop cluster consisting of many nodes, each of which has local direct attached storage (DAS)

Disks are running a standard file system (e.g., ext. xfs, etc.)HDFS blocks are stored as files in a special directoryDisks attached directly, for example, with SAS or SATANo storage is shared, disks only attach to a single node

The most common use case for HadoopOriginal design point for Hadoop/HDFSCan work with cheap unreliable hardwareSome very large systems utilize this model

16

Page 17: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

HDFS on local DAS

17

Compute nodes are part of HDFS, data spread across nodes

HDFS Protocol

Page 18: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

HDFS File Write Operation

18

Image source: Hadoop, The Definitive Guide Tom White, O’Reilly

3-wayreplication

Page 19: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

HDFS File Read Operation

19

Image source: Hadoop, The Definitive Guide Tom White, O’Reilly

Page 20: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

HDFS on local DAS - Pros and Cons

ProsWrites are highly parallel

Large files are broken into many parts, distributed across the clusterThree copies of any file block, one written local, two remoteNot a simple round-robin scheme, tuned for Hadoop jobs

Job tracker attempts to make reads localIf possible, tasks scheduled in same node as the needed file segmentDuplicate file segments are also readable, can be used for tasks too

ConsNot a replacement for general purpose storage

Not a kernel-based POSIX filesystemIncompatible with standard applications and utilities (but future versions of Hadoop are adding more other application models)

High replication cost compared with RAID/shared diskThe NameNode keeps track of data location

SPOF - location data is critical and must be protectedScalability bottleneck (everything has to be in memory)Improvements to NameNode are in the works

20

Page 21: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Other HDFS storage options

HDFS on Storage Area Network (SAN) attached storageA lot like DAS,

Disks are logical volumes in storage array(s), accessed across a SANHDFS doesn’t know the difference

– Still appears like a locally attached disk

SAN attached arrays aren’t the same as DASArray has its own cache, redundancy, replication, etc.Any node on the SAN can access any array volume

So a new node can be assigned to a failed node’s data

21

Page 22: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

HDFS with SAN Storage

22

Storage Arrays Compute nodes

Hadoop Cluster

iSCSI or FC SAN

Page 23: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

HDFS File Write Operation

23

Array can provide redundancy, no need to replicate data across data nodes

Array Replication

Page 24: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

HDFS File Read Operation

24

Array redundancy, means only a single source for data

Page 25: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

SAN for Hadoop Storage

Instead of storing data on direct attached local disks, data is in one or more arrays attached to data nodes through a SAN

Looks like local storage to data nodesHadoop still utilizes HDFS

ProsAll the normal advantages of arrays

RAID, centralized caching, thin provisioning, other advanced array featuresCentralized management, easy redistribution of storage

Retains advantages of HDFS (as long as array is not over-utilized)Easy failover when compute node dies, can eliminate or reduce 3-way replication

ConsCost? It dependsUnless if multiple arrays are used, scale is limited

And with multiple arrays, management and cost advantages are reducedStill have HDFS complexity and manageability issues

25

Page 26: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Overview

IntroductionWhat is HadoopWhat is MapReduceHow does Hadoop use storageDistributed filesystem concepts

Storage optionsNative Hadoop – HDFS

On direct attached storageOn networked (SAN) storage

Alternative distributed filesystemsCloud object storageEmerging options

26

Page 27: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Other distributed filesystems

Kernel-based tightly coupled distributed file systemKernel-based, i.e., no special access libraries, looks like a normal local file systemThese filesystems have existed for years in high performance computing, scale-out NAS servers, and other scale-out computing environments

Many commercial and research examplesNot originally designed for Hadoop like HDFS

Location awareness is part of the file system – no NameNodeWorks better if functionality is “exposed” to Hadoop

Compute nodes may or may not have local storageCompute nodes are “part of the storage cluster,” but may be diskless – i.e., equal access to files and global namespace

Can tie the filesystem’s location awareness into task tracker to reduce remote storage access

Remote storage is accessed using a filesystem specific inter-node protocol

Single network hop due to filesystem’s location awareness

27

Page 28: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Tightly coupled DFS for Hadoop

General purpose shared file systemImplemented in the kernel, single namespace, compatible with most applications (no special library or language)

Data is distributed across local storage node disksArchitecturally like HDFS

Can utilize same disk options as HDFS – Including shared nothing DAS– SAN storage

Some can also support “shared SAN” storage where raw volumes can be accessed by multiple nodes

– Failover model – where only one node actively uses a volume, other can take over after failure

– Multiple initiator model – where multiple nodes actively use a volume

Shared nothing option has similar cost/performance to HDFS on DAS

28

Page 29: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Distributed FS – local disks

29

Compute nodes are part of the DFS, data spread across nodes …

Distributed FS inter-node

Protocol

Page 30: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Distributed FS – remote disks

30

Compute nodes are distributed FS clients

Scale out nodes are distributed FS servers

Distributed FS inter-node Protocol

Page 31: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Remote DFS Write Operation

31

Note that these diagrams are intended to be generic, and leave out much of the detail of any specific DFS

Page 32: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Local DFS Write Operation

32

Note that these diagrams are intended to be generic, and leave out much of the detail of any specific DFS

Page 33: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Remote DFS Read Operation

33

Note that these diagrams are intended to be generic, and leave out much of the detail of any specific DFS

Page 34: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Local DFS Read Operation

34

Note that these diagrams are intended to be generic, and leave out much of the detail of any specific DFS

Page 35: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Tightly coupled DFS for Hadoop

ProsShared data access, any node can access any data like it is localPOSIX compatible, works for non-Hadoop apps just like a local file systemCentralized management and administrationNo NameNode, may have a better block mapping mechanismCompute in-place, same copy can be served via NFS/CIFSMany of the performance benefits

ConsHDFS is highly optimized for Hadoop, unlikely to get same optimization for a general purpose DFS

Large file striping is not regular, based on compute distributionCopies are simultaneously readable

Strict POSIX compliance leads to unnecessary serializationHadoop assumes multiple-access to files, however, accesses are on block boundaries and don’t overlapNeed to relax POSIX compliance for large files, or just stick with many smaller files

Some DFS’s have scaling limitations that are worse than HDFS, not designed for “thousands” of nodes

35

Page 36: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Overview

IntroductionWhat is HadoopWhat is MapReduceHow does Hadoop use storageDistributed filesystem concepts

Storage optionsNative Hadoop – HDFS

On direct attached storageOn networked (SAN) storage

Alternative distributed filesystemsCloud object storageEmerging options

36

Page 37: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Cloud Object Storage for Hadoop

Uses a REST API like CDMI, S3, or SwiftHTTP based protocol, data is remoteObjects are write once, read many, streaming accessObjects have some stored metadata

Data is stored in cloud object storageCould be local or across internetCheap, high volumeSystems utilize triple redundancy or erasure coding, for reliability

Often uses Hadoop “S3” connector

37

Page 38: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Hadoop on Object Storage

38

Cloud Object Storage

RES

T/H

TT

P

Page 39: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Hadoop Write on Object Storage

39

Page 40: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Hadoop Read on Object Storage

40

Page 41: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Object Storage for Hadoop

ProsLow cost, high volume, reliable storageGood location for infrequently used “WORM” dataPublic cloud optionsScalable storageData can easily be shared between Hadoop and other applications

ConsAll data is remote – performance

No data/compute colocationLimited capabilities, though a good match for HadoopHigh disk cost if triple redundancy is used

Good choice for large infrequently accessed WORM items that may need to be accessed by non-Hadoop jobs as well

41

Page 42: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Emerging Options

New options are emerging in the storage research community

Caching – from enterprise storageMirror to enterprise storage, NAS/NFSSSD

Improvements to HDFSHA optionsAccess to non-Hadoop jobs

Bottom lineThe limitations of HDFS are knownWork is ongoing to improve Hadoop storage options

42

Page 43: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Summary

Hadoop provides a scalable fault-tolerant environment for analyzing unstructured and structured informationThe default way to store data for Hadoop is HDFS on local direct attached disksAlternatives to this architecture include SAN array storage, tightly-coupled general purpose DFS, and cloud object storage

They can provide some significant advantagesHowever, they aren’t without their downsides, its hard to beat a filesystem designed specifically for Hadoop

Which one is best for you?Depends on what is most important – cost, manageability, compatibility with existing infrastructure, performance, scale, …

43

Page 44: SamFineberg Big Data Hadoop Storage Options 3v9 - SNIA · PDF fileBig Data Storage Options for HadoopPRESENTATION TITLE GOES HERE ... Open source under the friendly Apache License

Big Data Storage Options for Hadoop© 2013 Storage Networking Industry Association. All Rights Reserved.

Attribution & Feedback

44

Please send any questions or comments regarding this SNIA Tutorial to [email protected]

The SNIA Education Committee thanks the following individuals for their contributions to this Tutorial.

Authorship History

Sam Fineberg, August 2012

Updates:Sam Fineberg, February 2013Sam Fineberg, March 2013

Additional ContributorsRob PeglarJoseph WhiteChris Santilli