[b4]deview 2012-hdfs

30
HDFS ARCHITECTURE How HDFS is evolving to meet new needs

Upload: naver-d2

Post on 13-Jan-2015

2.392 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: [B4]deview 2012-hdfs

HDFS ARCHITECTURE How HDFS is evolving to meet new needs

Page 2: [B4]deview 2012-hdfs

✛ Aaron T. Myers ✛ Hadoop PMC Member / Committer at ASF ✛ Software Engineer at Cloudera ✛ Primarily work on HDFS and Hadoop Security

2

Page 3: [B4]deview 2012-hdfs

✛ HDFS architecture circa 2010 ✛ New requirements for HDFS

> Random read patterns > Higher scalability > Higher availability

✛ HDFS evolutions to address requirements > Read pipeline performance improvements > Federated namespaces > Highly available Name Node

3

Page 4: [B4]deview 2012-hdfs

HDFS ARCHITECTURE: 2010

Page 5: [B4]deview 2012-hdfs

✛ Each cluster has… > A single Name Node

∗  Stores file system metadata ∗  Stores “Block ID” -> Data Node mapping

> Many Data Nodes ∗  Store actual file data

> Clients of HDFS… ∗  Communicate with Name Node to browse file system, get

block locations for files ∗  Communicate directly with Data Nodes to read/write files

5

Page 6: [B4]deview 2012-hdfs

6

Page 7: [B4]deview 2012-hdfs

✛ Want to support larger clusters > ~4,000 node limit with 2010 architecture > New nodes beefier than old nodes

∗  2009: 8 cores, 16GB RAM, 4x1TB disks ∗  2012: 16 cores, 48GB RAM, 12x3TB disks

✛ Want to increase availability > With rise of HBase, HDFS now serving live traffic > Downtime means immediate user-facing impact

✛ Want to improve random read performance > HBase usually does small, random reads, not bulk

7

Page 8: [B4]deview 2012-hdfs

✛ Single Name Node >  If Name Node goes offline, cluster is unavailable > Name Node must fit all FS metadata in memory

✛  Inefficiencies in read pipeline > Designed for large, streaming reads > Not small, random reads (like HBase use case)

8

Page 9: [B4]deview 2012-hdfs

✛  Fine for offline, batch-oriented applications ✛  If cluster goes offline, external customers don’t

notice ✛ Can always use separate clusters for different

groups ✛ HBase didn’t exist when Hadoop first created

> MapReduce was the only client application

9

Page 10: [B4]deview 2012-hdfs

HDFS PERFORMANCE IMPROVEMENTS

Page 11: [B4]deview 2012-hdfs

HDFS CPU Improvements: Checksumming

•  HDFS checksums every piece of data in/out •  Significant CPU overhead

•  Measure by putting ~1G in HDFS, cat file in a loop •  0.20.2: ~30-50% of CPU time is CRC32 computation!

•  Optimizations: •  Switch to “bulk” API: verify/compute 64KB at a time

instead of 512 bytes (better instruction cache locality, amortize JNI overhead)

•  Switch to CRC32C polynomial, SSE4.2, highly tuned assembly (~8 bytes per cycle with instruction level parallelism!)

11 Copyright 2011 Cloudera Inc. All rights reserved

Page 12: [B4]deview 2012-hdfs

12 Copyright 2011 Cloudera Inc. All rights reserved

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

100%

Random-read latency

Random-read CPU usage

Sequential-read CPU usage

Checksum improvements (lower is better)

CDH3u0 Optimized

Post-optimization: only 16% overhead vs un-checksummed access Maintain ~800MB/sec from a single thread reading OS cache

1360us

760us

Page 13: [B4]deview 2012-hdfs

HDFS Random access

•  0.20.2: •  Each individual read operation reconnects to

DataNode •  Much TCP Handshake overhead, thread creation,

etc •  2.0.0:

•  Clients cache open sockets to each datanode (like HTTP Keepalive)

•  Local readers can bypass the DN in some circumstances to directly read data

•  Rewritten BlockReader to eliminate a data copy •  Eliminated lock contention in DataNode’s

FSDataset class

13 Copyright 2011 Cloudera Inc. All rights reserved

Page 14: [B4]deview 2012-hdfs

14 Copyright 2011 Cloudera Inc. All rights reserved

Random-read micro benchmark (higher is better)

106 247 187 253 488 477 299 635 633 0

100

200

300

400

500

600

700

4 threads, 1 file 16 threads, 1 file 8 threads, 2 files

Spee

d (M

B/s

ec)

0.20.2 Trunk (no native) Trunk (native)

TestParallelRead benchmark, modified to 100% random read proportion. Quad core Core i7 [email protected]

Page 15: [B4]deview 2012-hdfs

Random-read macro benchmark (HBase YCSB)

15 Copyright 2011 Cloudera Inc. All rights reserved

time

Rea

ds/s

ec

CDH4

CDH3u1

Page 16: [B4]deview 2012-hdfs

HDFS FEDERATION ARCHITECTURE

Page 17: [B4]deview 2012-hdfs

✛  Instead of one Name Node per cluster, several > Before: Only one Name Node, many Data Nodes > Now: A handful of Name Nodes, many Data Nodes

✛ Distribute file system metadata between the NNs

✛ Each Name Node operates independently > Potentially overlapping ranges of block IDs >  Introduce a new concept: block pool ID > Each Name Node manages a single block pool

Page 18: [B4]deview 2012-hdfs

HDFS Architecture: Federation

Page 19: [B4]deview 2012-hdfs

✛  Improve scalability to 6,000+ Data Nodes > Bumping into single Data Node scalability now

✛ Allow for better isolation > Could locate HBase dirs on dedicated Name Node > Could locate /user dirs on dedicated Name Node

✛ Clients still see unified view of FS namespace > Use ViewFS – client side mount table configuration

19

Note: Federation != Increased Availability

Page 20: [B4]deview 2012-hdfs

HDFS HIGH AVAILABILITY ARCHITECTURE

Page 21: [B4]deview 2012-hdfs

Current HDFS Availability & Data Integrity

•  Simple design, storage fault tolerance •  Storage: Rely on OS’s file system rather

than use raw disk •  Storage Fault Tolerance: multiple replicas,

active monitoring •  Single NameNode Master •  Persistent state: multiple copies + checkpoints •  Restart on failure

21

Page 22: [B4]deview 2012-hdfs

Current HDFS Availability & Data Integrity

•  How well did it work?

•  Lost 19 out of 329 Million blocks on 10 clusters with 20K nodes in 2009 •  7-9’s of reliability, and that bug was fixed in 0.20

•  18 months Study: 22 failures on 25 clusters - 0.58 failures

per year per cluster •  Only 8 would have benefitted from HA failover!! (0.23

failures per cluster year)

22

Page 23: [B4]deview 2012-hdfs

So why build an HA NameNode?

•  Most cluster downtime in practice is planned downtime •  Cluster restart for a NN configuration change (e.g

new JVM configs, new HDFS configs) •  Cluster restart for a NN hardware upgrade/repair •  Cluster restart for a NN software upgrade (e.g. new

Hadoop, new kernel, new JVM) •  Planned downtimes cause the vast majority of

outage!

•  Manual failover solves all of the above! •  Failover to NN2, fix NN1, fail back to NN1, zero

downtime

23

Page 24: [B4]deview 2012-hdfs

Approach and Terminology

•  Initial goal: Active-Standby with Hot Failover

•  Terminology •  Active NN: actively serves read/write

operations from clients •  Standby NN: waits, becomes active when

Active dies or is unhealthy •  Hot failover: standby able to take over

instantly

24

Page 25: [B4]deview 2012-hdfs

HDFS Architecture: High Availability

•  Single NN configuration; no failover •  Active and Standby with manual failover

•  Addresses downtime during upgrades – main cause of unavailability

•  Active and Standby with automatic failover •  Addresses downtime during unplanned outages

(kernel panics, bad memory, double PDU failure, etc)

•  See HDFS-1623 for detailed use cases •  With Federation each namespace volume has an

active-standby NameNode pair

25

Page 26: [B4]deview 2012-hdfs

HDFS Architecture: High Availability

•  Failover controller outside NN •  Parallel Block reports to Active and

Standby •  NNs share namespace state via a shared

edit log •  NAS or Journal Nodes •  Like RDBMS “log shipping replication”

•  Client failover •  Smart clients (e.g configuration, or ZooKeeper for

coordination) •  IP Failover in the future

26

Page 27: [B4]deview 2012-hdfs

HDFS Architecture: High Availability

Page 28: [B4]deview 2012-hdfs

HDFS ARCHITECTURE: WHAT’S NEXT

Page 29: [B4]deview 2012-hdfs

✛  Increase scalability of single Data Node > Currently the most-noticed scalability limit

✛ Support for point-in-time snapshots > To better support DR, backups

✛ Completely separate block / namespace layers >  Increase scalability even further, new use cases

✛  Fully distributed NN metadata > No pre-determined “special nodes” in the system

Page 30: [B4]deview 2012-hdfs