Download - [B4]deview 2012-hdfs
HDFS ARCHITECTURE How HDFS is evolving to meet new needs
✛ Aaron T. Myers ✛ Hadoop PMC Member / Committer at ASF ✛ Software Engineer at Cloudera ✛ Primarily work on HDFS and Hadoop Security
2
✛ HDFS architecture circa 2010 ✛ New requirements for HDFS
> Random read patterns > Higher scalability > Higher availability
✛ HDFS evolutions to address requirements > Read pipeline performance improvements > Federated namespaces > Highly available Name Node
3
HDFS ARCHITECTURE: 2010
✛ Each cluster has… > A single Name Node
∗ Stores file system metadata ∗ Stores “Block ID” -> Data Node mapping
> Many Data Nodes ∗ Store actual file data
> Clients of HDFS… ∗ Communicate with Name Node to browse file system, get
block locations for files ∗ Communicate directly with Data Nodes to read/write files
5
6
✛ Want to support larger clusters > ~4,000 node limit with 2010 architecture > New nodes beefier than old nodes
∗ 2009: 8 cores, 16GB RAM, 4x1TB disks ∗ 2012: 16 cores, 48GB RAM, 12x3TB disks
✛ Want to increase availability > With rise of HBase, HDFS now serving live traffic > Downtime means immediate user-facing impact
✛ Want to improve random read performance > HBase usually does small, random reads, not bulk
7
✛ Single Name Node > If Name Node goes offline, cluster is unavailable > Name Node must fit all FS metadata in memory
✛ Inefficiencies in read pipeline > Designed for large, streaming reads > Not small, random reads (like HBase use case)
8
✛ Fine for offline, batch-oriented applications ✛ If cluster goes offline, external customers don’t
notice ✛ Can always use separate clusters for different
groups ✛ HBase didn’t exist when Hadoop first created
> MapReduce was the only client application
9
HDFS PERFORMANCE IMPROVEMENTS
HDFS CPU Improvements: Checksumming
• HDFS checksums every piece of data in/out • Significant CPU overhead
• Measure by putting ~1G in HDFS, cat file in a loop • 0.20.2: ~30-50% of CPU time is CRC32 computation!
• Optimizations: • Switch to “bulk” API: verify/compute 64KB at a time
instead of 512 bytes (better instruction cache locality, amortize JNI overhead)
• Switch to CRC32C polynomial, SSE4.2, highly tuned assembly (~8 bytes per cycle with instruction level parallelism!)
11 Copyright 2011 Cloudera Inc. All rights reserved
12 Copyright 2011 Cloudera Inc. All rights reserved
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
100%
Random-read latency
Random-read CPU usage
Sequential-read CPU usage
Checksum improvements (lower is better)
CDH3u0 Optimized
Post-optimization: only 16% overhead vs un-checksummed access Maintain ~800MB/sec from a single thread reading OS cache
1360us
760us
HDFS Random access
• 0.20.2: • Each individual read operation reconnects to
DataNode • Much TCP Handshake overhead, thread creation,
etc • 2.0.0:
• Clients cache open sockets to each datanode (like HTTP Keepalive)
• Local readers can bypass the DN in some circumstances to directly read data
• Rewritten BlockReader to eliminate a data copy • Eliminated lock contention in DataNode’s
FSDataset class
13 Copyright 2011 Cloudera Inc. All rights reserved
14 Copyright 2011 Cloudera Inc. All rights reserved
Random-read micro benchmark (higher is better)
106 247 187 253 488 477 299 635 633 0
100
200
300
400
500
600
700
4 threads, 1 file 16 threads, 1 file 8 threads, 2 files
Spee
d (M
B/s
ec)
0.20.2 Trunk (no native) Trunk (native)
TestParallelRead benchmark, modified to 100% random read proportion. Quad core Core i7 [email protected]
Random-read macro benchmark (HBase YCSB)
15 Copyright 2011 Cloudera Inc. All rights reserved
time
Rea
ds/s
ec
CDH4
CDH3u1
HDFS FEDERATION ARCHITECTURE
✛ Instead of one Name Node per cluster, several > Before: Only one Name Node, many Data Nodes > Now: A handful of Name Nodes, many Data Nodes
✛ Distribute file system metadata between the NNs
✛ Each Name Node operates independently > Potentially overlapping ranges of block IDs > Introduce a new concept: block pool ID > Each Name Node manages a single block pool
HDFS Architecture: Federation
✛ Improve scalability to 6,000+ Data Nodes > Bumping into single Data Node scalability now
✛ Allow for better isolation > Could locate HBase dirs on dedicated Name Node > Could locate /user dirs on dedicated Name Node
✛ Clients still see unified view of FS namespace > Use ViewFS – client side mount table configuration
19
Note: Federation != Increased Availability
HDFS HIGH AVAILABILITY ARCHITECTURE
Current HDFS Availability & Data Integrity
• Simple design, storage fault tolerance • Storage: Rely on OS’s file system rather
than use raw disk • Storage Fault Tolerance: multiple replicas,
active monitoring • Single NameNode Master • Persistent state: multiple copies + checkpoints • Restart on failure
21
Current HDFS Availability & Data Integrity
• How well did it work?
• Lost 19 out of 329 Million blocks on 10 clusters with 20K nodes in 2009 • 7-9’s of reliability, and that bug was fixed in 0.20
• 18 months Study: 22 failures on 25 clusters - 0.58 failures
per year per cluster • Only 8 would have benefitted from HA failover!! (0.23
failures per cluster year)
22
So why build an HA NameNode?
• Most cluster downtime in practice is planned downtime • Cluster restart for a NN configuration change (e.g
new JVM configs, new HDFS configs) • Cluster restart for a NN hardware upgrade/repair • Cluster restart for a NN software upgrade (e.g. new
Hadoop, new kernel, new JVM) • Planned downtimes cause the vast majority of
outage!
• Manual failover solves all of the above! • Failover to NN2, fix NN1, fail back to NN1, zero
downtime
23
Approach and Terminology
• Initial goal: Active-Standby with Hot Failover
• Terminology • Active NN: actively serves read/write
operations from clients • Standby NN: waits, becomes active when
Active dies or is unhealthy • Hot failover: standby able to take over
instantly
24
HDFS Architecture: High Availability
• Single NN configuration; no failover • Active and Standby with manual failover
• Addresses downtime during upgrades – main cause of unavailability
• Active and Standby with automatic failover • Addresses downtime during unplanned outages
(kernel panics, bad memory, double PDU failure, etc)
• See HDFS-1623 for detailed use cases • With Federation each namespace volume has an
active-standby NameNode pair
25
HDFS Architecture: High Availability
• Failover controller outside NN • Parallel Block reports to Active and
Standby • NNs share namespace state via a shared
edit log • NAS or Journal Nodes • Like RDBMS “log shipping replication”
• Client failover • Smart clients (e.g configuration, or ZooKeeper for
coordination) • IP Failover in the future
26
HDFS Architecture: High Availability
HDFS ARCHITECTURE: WHAT’S NEXT
✛ Increase scalability of single Data Node > Currently the most-noticed scalability limit
✛ Support for point-in-time snapshots > To better support DR, backups
✛ Completely separate block / namespace layers > Increase scalability even further, new use cases
✛ Fully distributed NN metadata > No pre-determined “special nodes” in the system