multi-data-center hadoop in a snap dr. konstantin boudnik vice president, open source development
TRANSCRIPT
My background
● 15 years Sun Microsystems veteran: JVM, distributed systems
● Vice President, Apache Bigtop● Committer, PMC & contributor to various ASF projects● Member of Apache IPMC● Early Hadoop committer
3
WANdisco Background
• WANdisco: Wide Area Network Distributed Computing– Enterprise ready, high availability software solutions that enable globally distributed
organizations to meet today’s data challenges of secure storage, scalability and availability
• Leader in tools for software engineers – Subversion– Apache Software Foundation sponsor
• Highly successful IPO, London Stock Exchange, June 2012 (LSE:WAND)• US patented active-active replication technology granted, November 2012• Global locations
– San Ramon (CA)– Chengdu (China)– Tokyo (Japan)– Boston (MA)– Sheffield (UK)– Belfast (UK)
Non-Stop Hadoop
Non-Intrusive Plugin
Provides Continuous AvailabilityIn the LAN / Across the WAN
Active/Active
Enterprise Ready HadoopCharacteristics of Mission Critical Applications
• Require 100% Uptime of Hadoop– SLA’s, Regulatory Compliance
• Require HDFS to be Deployed Globally– Share Data Between Data Centers– Data is Consistent and Not Eventual
• Ease Administrative Burden– Reduce Operational Complexity– Simplify Disaster Recovery– Lower RTO/RPO
• Allow Maximum Utilization of Resource– Within the Data Center– Across Data Centers
Single Standby• Inefficient utilization of resource
– Journal Nodes– ZooKeeper Nodes– Standby Node
• Performance Bottleneck• Still tied to the beeper• Limited to LAN scope
Active / Active• All resources utilized
– Only NameNode configuration– Scale as the cluster grows– All NameNodes active
• Load balancing• Set resiliency (# of active NN)• Global Consistency
Breaking Away from Active/PassiveWhat’s in a NameNode
Standby Datacenter• Idle Resource
– Single Data Center Ingest– Disaster Recovery Only
• One way synchronization– DistCp
• Error Prone– Clusters can diverge over time
• Difficult to scale > 2 Data Centers– Complexity of sharing data
increases
Active / Active• DR Resource Available
– Ingest at all Data Centers– Run Jobs in both Data Centers
• Replication is Multi-Directional– active/active
• Absolute Consistency– Single HDFS spans locations
• ‘N’ Data Center support– Global HDFS allows appropriate
data to be shared
Breaking Away from Active/PassiveWhat’s in a Data Center
One Cluster Aproach
• Example Applications
– HBASE– RT Query– Map Reduce
• Poor Resource Management
– Data Locality Issues– Network Use– Complex
Multiple Clusters
Creating Multiple Clusters
• Example Applications
– HBASE– RT Query– Map Reduce
• Need to share data between clusters
– DistCp / Stale Data– Inefficient use of
storage and or network
– Some clusters may not be available
Multiple Clusters
Multi Datacenter HadoopDisaster Recovery
WAN REPLICATION
Absolute ConsistencyMaximum Resource Use
Lower Recovery Time/Point
Replicate Only What You WantBetter Utilization of Power/Cooling
Lower TCOLAN Speed Performance
Technical Use Cases
• Eliminate Performance Bottleneck– HBASE issues
• Multi Data-Center Ingest– Information doesn't need to be sent to one DC and then copied back to the other using DistCP– Parallel ingest methods don’t require redirected data streams– Ingest data at, or close to the source– Global Analysis (Logs, Click Streams, etc…)
• Cluster Zones– Efficient use of resource based on application profile– HBASE, MapReduce, SPARK, etc…
• Maximize Data Center Resource Utilization– All datacenters can be used to run different jobs concurrently
• Disaster Recovery– Data is as current as possible (no periodic synchs)– Virtually zero downtime to recover from regional data center failure– Regulatory compliance