[b4]deview 2012-hdfs

HDFS ARCHITECTURE How HDFS is evolving to meet new needs

✛ Aaron T. Myers ✛ Hadoop PMC Member / Committer at ASF ✛ Software Engineer at Cloudera ✛ Primarily work on HDFS and Hadoop Security

2

✛ HDFS architecture circa 2010 ✛ New requirements for HDFS

> Random read patterns > Higher scalability > Higher availability

✛ HDFS evolutions to address requirements > Read pipeline performance improvements > Federated namespaces > Highly available Name Node

3

HDFS ARCHITECTURE: 2010

✛ Each cluster has… > A single Name Node

∗  Stores file system metadata ∗  Stores “Block ID” -> Data Node mapping

> Many Data Nodes ∗  Store actual file data

> Clients of HDFS… ∗  Communicate with Name Node to browse file system, get

block locations for files ∗  Communicate directly with Data Nodes to read/write files

5

✛ Want to support larger clusters > ~4,000 node limit with 2010 architecture > New nodes beefier than old nodes

∗  2009: 8 cores, 16GB RAM, 4x1TB disks ∗  2012: 16 cores, 48GB RAM, 12x3TB disks

✛ Want to increase availability > With rise of HBase, HDFS now serving live traffic > Downtime means immediate user-facing impact

✛ Want to improve random read performance > HBase usually does small, random reads, not bulk

7

✛ Single Name Node >  If Name Node goes offline, cluster is unavailable > Name Node must fit all FS metadata in memory

✛  Inefficiencies in read pipeline > Designed for large, streaming reads > Not small, random reads (like HBase use case)

8

✛  Fine for offline, batch-oriented applications ✛  If cluster goes offline, external customers don’t

notice ✛ Can always use separate clusters for different

groups ✛ HBase didn’t exist when Hadoop first created

> MapReduce was the only client application

9

HDFS PERFORMANCE IMPROVEMENTS

HDFS CPU Improvements: Checksumming

•  HDFS checksums every piece of data in/out •  Significant CPU overhead

•  Measure by putting ~1G in HDFS, cat file in a loop •  0.20.2: ~30-50% of CPU time is CRC32 computation!

•  Optimizations: •  Switch to “bulk” API: verify/compute 64KB at a time

instead of 512 bytes (better instruction cache locality, amortize JNI overhead)

•  Switch to CRC32C polynomial, SSE4.2, highly tuned assembly (~8 bytes per cycle with instruction level parallelism!)

11 Copyright 2011 Cloudera Inc. All rights reserved


0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

100%

Random-read latency

Random-read CPU usage

Sequential-read CPU usage

Checksum improvements (lower is better)

CDH3u0 Optimized

Post-optimization: only 16% overhead vs un-checksummed access Maintain ~800MB/sec from a single thread reading OS cache

1360us

760us

HDFS Random access

•  0.20.2: •  Each individual read operation reconnects to

DataNode •  Much TCP Handshake overhead, thread creation,

etc •  2.0.0:

•  Clients cache open sockets to each datanode (like HTTP Keepalive)

•  Local readers can bypass the DN in some circumstances to directly read data

•  Rewritten BlockReader to eliminate a data copy •  Eliminated lock contention in DataNode’s

FSDataset class



Random-read micro benchmark (higher is better)

106 247 187 253 488 477 299 635 633 0

100

200

300

400

500

600

700

4 threads, 1 file 16 threads, 1 file 8 threads, 2 files

Spee

d (M

B/s

ec)

0.20.2 Trunk (no native) Trunk (native)

TestParallelRead benchmark, modified to 100% random read proportion. Quad core Core i7 [email protected]

Random-read macro benchmark (HBase YCSB)


time

Rea

ds/s

ec

CDH4

CDH3u1

HDFS FEDERATION ARCHITECTURE

✛  Instead of one Name Node per cluster, several > Before: Only one Name Node, many Data Nodes > Now: A handful of Name Nodes, many Data Nodes

✛ Distribute file system metadata between the NNs

✛ Each Name Node operates independently > Potentially overlapping ranges of block IDs >  Introduce a new concept: block pool ID > Each Name Node manages a single block pool

HDFS Architecture: Federation

✛  Improve scalability to 6,000+ Data Nodes > Bumping into single Data Node scalability now

✛ Allow for better isolation > Could locate HBase dirs on dedicated Name Node > Could locate /user dirs on dedicated Name Node

✛ Clients still see unified view of FS namespace > Use ViewFS – client side mount table configuration

19

Note: Federation != Increased Availability

HDFS HIGH AVAILABILITY ARCHITECTURE

Current HDFS Availability & Data Integrity

•  Simple design, storage fault tolerance •  Storage: Rely on OS’s file system rather

than use raw disk •  Storage Fault Tolerance: multiple replicas,

active monitoring •  Single NameNode Master •  Persistent state: multiple copies + checkpoints •  Restart on failure

21

Current HDFS Availability & Data Integrity

•  How well did it work?

•  Lost 19 out of 329 Million blocks on 10 clusters with 20K nodes in 2009 •  7-9’s of reliability, and that bug was fixed in 0.20

•  18 months Study: 22 failures on 25 clusters - 0.58 failures

per year per cluster •  Only 8 would have benefitted from HA failover!! (0.23

failures per cluster year)

22

So why build an HA NameNode?

•  Most cluster downtime in practice is planned downtime •  Cluster restart for a NN configuration change (e.g

new JVM configs, new HDFS configs) •  Cluster restart for a NN hardware upgrade/repair •  Cluster restart for a NN software upgrade (e.g. new

Hadoop, new kernel, new JVM) •  Planned downtimes cause the vast majority of

outage!

•  Manual failover solves all of the above! •  Failover to NN2, fix NN1, fail back to NN1, zero

downtime

23

Approach and Terminology

•  Initial goal: Active-Standby with Hot Failover

•  Terminology •  Active NN: actively serves read/write

operations from clients •  Standby NN: waits, becomes active when

Active dies or is unhealthy •  Hot failover: standby able to take over

instantly

24

HDFS Architecture: High Availability

•  Single NN configuration; no failover •  Active and Standby with manual failover

•  Addresses downtime during upgrades – main cause of unavailability

•  Active and Standby with automatic failover •  Addresses downtime during unplanned outages

(kernel panics, bad memory, double PDU failure, etc)

•  See HDFS-1623 for detailed use cases •  With Federation each namespace volume has an

active-standby NameNode pair

25


•  Failover controller outside NN •  Parallel Block reports to Active and

Standby •  NNs share namespace state via a shared

edit log •  NAS or Journal Nodes •  Like RDBMS “log shipping replication”

•  Client failover •  Smart clients (e.g configuration, or ZooKeeper for

coordination) •  IP Failover in the future

26

HDFS ARCHITECTURE: WHAT’S NEXT

✛  Increase scalability of single Data Node > Currently the most-noticed scalability limit

✛ Support for point-in-time snapshots > To better support DR, backups

✛ Completely separate block / namespace layers >  Increase scalability even further, new use cases

✛  Fully distributed NN metadata > No pre-determined “special nodes” in the system

[b4]deview 2012-hdfs

Technology

hdfs architecturehow

hdfs random access

hdfs federation architecture

single data node scalability

hdfs cpu improvements

hdfs performance improvements

node clients

node limit