october 2016 hug: the pillars of effective data archiving and tiering in hadoop

1

Tiering & Archive for Hadoop Today

2

Storage Policies & Disk Types (Hadoop 2.6 and up)

Disk Type Flexible, can assign to any local filesystemDisk Policy Set on file or inherited from parent directory

Hadoop HDFS Tiering Supportaka – Hetrogenous Storage

Storage Policy Name Disk Type (n replicas)

Lasy_Persist RAM_DISK: 1, DISK: n-1

All_SSD SSD: n

One_SSD SSD: 1, DISK: n-1

Hot (default) DISK: n

Warm DISK: 1, ARCHIVE: n-1

Cold ARCHIVE: n

3


/data/results/query2.csv

Hot Nodes

Storage Policy default is HotStorage Type default is DISK

Archive Nodes

Storage Policy: HOTStorage Type: DISK

4


Hot Nodes

Storage Policy is changedFile remains on same storage type until mover is run

Archive Nodes

Storage Policy: ColdStorage Type: DISK


5


Storage Policy: ColdStorage Type: ARCHIVE

Hot Nodes Archive Nodes

After mover is run, all replicas move to storage type Archive. Note: file has not logically moved in HDFS


6

WHY TIER HADOOP STORAGE?

ISN’T IT ALREADY COMMODITY STORAGE?(aka – The cheapest stuff on the planet)

Tiering on Hadoop – WHY?

7

Lower Disk Capacity to ComputeTraditional Hadoop Storage

Compute

Disk

Better job scalability, performance, and consistent results

5x to 10x more expensive per GB

8

Much Denser Disk to ComputeHadoop Archive Storage

Compute

Disk

Much less $ per GB

Could impact performance and produce inconsistent results

9

Cold Goes to Archive. Hot Gets More ResourcesHadoop Archive Storage

Compute

Disk

Much less $ per GB

More resources are free to process jobs.

Compute

Disk

Better Performance & Lower Infrastructure Costs

10

SO How do I utilize archive storage to lower my storage costs without performance impact?

Answer: Intelligent Tiering

Tiering on Hadoop

11

Pillars of Intelligent Tiering for Hadoop

HEA

T

AG

E

SIZE

USA

GE

Access frequency of data is the most important metric for effective tiering

Age is easiest to determine. CAUTION: Some data is long-term active so this cannot be the only criteria.

Zero and small files should be archived differently in tiering Hadoop.Large cold files should have priority for archive

Knowing how long data is accessed once ingested can provide better capacity planning for your tiers.

12

Installed on a server or VM outside your existing Hadoop cluster without inserting any proprietary technology on the cluster or in the data path.

FactorData Approach

Report data usage (heat), small files, user activity, replication, and HDFS tier utilization. Customize rules and queries to properly utilize infrastructure and plan better for future scale.

Automatically archive, promote, or change the replication factor of data based on usage patterns and user defined rules.

Tier Hadoop HDFS By Heat, Age, Size & Usage In Three Easy Steps

01/INSTALL WITHOUT CHANGES TO CLUSTER

02/VISUALIZE & REPORT

03/AUTOMATE OPTIMIZATION

13

FactorData HDFSplus Architecture

Completely out of the data pathFactorData HDFSplus sits outside the Hadoop cluster and collects only metadata information from the Hadoop cluster

No software to install on the existing Hadoop clusterBecause HDFSplus leverages only existing Hadoop APIs and features, there is no software to install on the cluster.

Provides a highly scalable solution in a small foot-printHDFS visibility and automation for thousands of Hadoop nodes on a single node, VM or server

HDFSplus

Namenodes

Communicates withExisting Hadoop API

VM or Physical Machine32GB RAM

4 CPU or vCPU500GB Free Disk

14

Simplify and Automate Archive and Tiering in Hadoop Today• Move seldom accessed data to storage dense archive nodes • Lower software licensing with less infrastructure• Free resources on existing namenodes and datanodes

FactorData Tiering & Archive on Hadoop

Who or what application is creating all these small files in the cluster?

How can we move data not accessed for 90 days to archive nodes?

How can we better plan for future scale with real Hadoop storage metrics?

Result: Better Performance, Lower Hardware Costs, Lower Software Costs

Plus: Get Necessary Storage Visibility To Answer These Questions & More with FactorData HDFSplus

15

Thank YouVisit us at: http://www.factordata.com

http://www.factordata.com/

16

Backup

17

FactorData Automates HDFS TieringHDFSplus

Apply storage policy based on custom query

HDFS

Files are optimized during normal balancing window

Query list based on size, heat, activity, and age

1 2 3

• Move all files 120 days old and not accessed for 90 days to ARCHIVE…..

• FactorData creates a data list based on query

FactorData Archive Tiering Example:

• Limit automated run by max files or capacity

• FactorData tracks completion of each run

• Data can be excluded from run according to path, size and application

Custom Query Example: Automated Tiering:

october 2016 hug: the pillars of effective data archiving and tiering in hadoop

Technology