multi-data-center hadoop in a snap dr. konstantin boudnik vice president, open source development

18
Multi-Data-Center Hadoop in a Snap Dr. Konstantin Boudnik Vice President, Open Source Development

Upload: lester-elliott

Post on 16-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Multi-Data-Center Hadoop in a Snap

Dr. Konstantin BoudnikVice President, Open Source Development

My background

● 15 years Sun Microsystems veteran: JVM, distributed systems

● Vice President, Apache Bigtop● Committer, PMC & contributor to various ASF projects● Member of Apache IPMC● Early Hadoop committer

3

WANdisco Background

• WANdisco: Wide Area Network Distributed Computing– Enterprise ready, high availability software solutions that enable globally distributed

organizations to meet today’s data challenges of secure storage, scalability and availability

• Leader in tools for software engineers – Subversion– Apache Software Foundation sponsor

• Highly successful IPO, London Stock Exchange, June 2012 (LSE:WAND)• US patented active-active replication technology granted, November 2012• Global locations

– San Ramon (CA)– Chengdu (China)– Tokyo (Japan)– Boston (MA)– Sheffield (UK)– Belfast (UK)

Customers

Non-Stop Hadoop

Non-Intrusive Plugin

Provides Continuous AvailabilityIn the LAN / Across the WAN

Active/Active

3 Key Problems For Multi Cluster HadoopLAN / WAN

Enterprise Ready HadoopCharacteristics of Mission Critical Applications

• Require 100% Uptime of Hadoop– SLA’s, Regulatory Compliance

• Require HDFS to be Deployed Globally– Share Data Between Data Centers– Data is Consistent and Not Eventual

• Ease Administrative Burden– Reduce Operational Complexity– Simplify Disaster Recovery– Lower RTO/RPO

• Allow Maximum Utilization of Resource– Within the Data Center– Across Data Centers

Single Standby• Inefficient utilization of resource

– Journal Nodes– ZooKeeper Nodes– Standby Node

• Performance Bottleneck• Still tied to the beeper• Limited to LAN scope

Active / Active• All resources utilized

– Only NameNode configuration– Scale as the cluster grows– All NameNodes active

• Load balancing• Set resiliency (# of active NN)• Global Consistency

Breaking Away from Active/PassiveWhat’s in a NameNode

Standby Datacenter• Idle Resource

– Single Data Center Ingest– Disaster Recovery Only

• One way synchronization– DistCp

• Error Prone– Clusters can diverge over time

• Difficult to scale > 2 Data Centers– Complexity of sharing data

increases

Active / Active• DR Resource Available

– Ingest at all Data Centers– Run Jobs in both Data Centers

• Replication is Multi-Directional– active/active

• Absolute Consistency– Single HDFS spans locations

• ‘N’ Data Center support– Global HDFS allows appropriate

data to be shared

Breaking Away from Active/PassiveWhat’s in a Data Center

One Cluster Aproach

• Example Applications

– HBASE– RT Query– Map Reduce

• Poor Resource Management

– Data Locality Issues– Network Use– Complex

Multiple Clusters

Creating Multiple Clusters

• Example Applications

– HBASE– RT Query– Map Reduce

• Need to share data between clusters

– DistCp / Stale Data– Inefficient use of

storage and or network

– Some clusters may not be available

Multiple Clusters

Cluster ZonesZoning for Optimal Efficiency

1

100%

HDFS

Consistency

Multi Datacenter HadoopDisaster Recovery

WAN REPLICATION

Absolute ConsistencyMaximum Resource Use

Lower Recovery Time/Point

Replicate Only What You WantBetter Utilization of Power/Cooling

Lower TCOLAN Speed Performance

Architecture of a Non-Stop Hadoop

Technical Use Cases

• Eliminate Performance Bottleneck– HBASE issues

• Multi Data-Center Ingest– Information doesn't need to be sent to one DC and then copied back to the other using DistCP– Parallel ingest methods don’t require redirected data streams– Ingest data at, or close to the source– Global Analysis (Logs, Click Streams, etc…)

• Cluster Zones– Efficient use of resource based on application profile– HBASE, MapReduce, SPARK, etc…

• Maximize Data Center Resource Utilization– All datacenters can be used to run different jobs concurrently

• Disaster Recovery– Data is as current as possible (no periodic synchs)– Virtually zero downtime to recover from regional data center failure– Regulatory compliance

Non-Stop Hadoop Demonstration

Q & A

Thank you