jan 2013 hug: cloud-friendly hadoop and hive

20
Qubole Inc., Proprietary Hadoop User Group Ashish Thusoo Jan 16, 2013

Upload: yahoo-developer-network

Post on 23-Jun-2015

2.788 views

Category:

Documents


1 download

DESCRIPTION

The cloud reduces the barrier to entry for many small and medium size enterprises into analytics. Hadoop and related frameworks like Hive, Oozie, Sqoop are becoming tools of choice for deriving insights from data. However, these frameworks were designed for in-house datacenters which have different tradeoffs from a cloud environment and making them run well in the cloud presents some challenges. In this talk, we describe how we've extended Hadoop and Hive to exploit these new tradeoffs and offer them as part of the Qubole Data Service (QDS). We will also present use-cases that show how QDS is making it extremely easy for an end user to use these technologies in the cloud. Speaker: Ashish Thusoo, CEO, Qubole

TRANSCRIPT

Page 1: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

Qubole Inc., Proprietary

Hadoop User GroupAshish ThusooJan 16, 2013

Page 2: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

Qubole Inc., Proprietary

About Me

Big Data Veteran

Ran the data infrastructure team at Facebookbefore starting Qubole

Co-created Hive in 2007 @ Facebook

••

Page 3: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

Qubole Inc., Proprietary

What is Qubole?

A comprehensive cloud data platform basedon Hadoop and Hive for data in the cloud

Turnkey Infrastructure

Cloud Optimized Stack

Open Data Formats

Useful for exploring data and creating batchprocessing applications/data pipelines

---

Page 4: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

Qubole Inc., Proprietary

Why Qubole?

End Users(User Ops, Product Managers

etc.)

Heterogenous Data(Structured & Unstructured)

The Intermediaries(Data Scientists and

Engineers)

BOTTLENECK

Page 5: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

Qubole Inc., Proprietary

Qubole Service

Cloud Data Service

Cloud Data PlatformElastic . Robust . Fast

DataMarts

Explore Schedule SDK

EC2 / S3

Big Data Technology Stack

ODBC

Connectors

API

Logs

Events

DBs

Metrics

Cloud Sources

Page 6: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

Qubole Inc., Proprietary

Cloud vs Bare Metal

Dynamic vs Fixed Provisioning

Separation between Compute and Storage

Purchasing and Budgeting

•••

Page 7: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

Qubole Inc., Proprietary

Dynamic Provisioning

Advantage: Transient Clusters

Burden: How big of a cluster do I need?

Solution: Auto-scaled Hadoop

•••

Page 8: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

Qubole Inc., Proprietary

Challenges:Auto-scaledHadoop

http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/

Adapting to Burstiness

Current load is not enough, also need to predict futureload

Adapting State-fully

Removing HDFS nodes is risky withoutdecommissioning

•-

•-

Page 9: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

Qubole Inc., Proprietary

Implementation:Auto-scaledHadoop

http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/

TaskTrackers report launch times ofJobTracker

JT computes amount of time required tofinish existing workloads

If the time is above a certain threshold thenmore nodes are added

At hourly boundaries the nodes are removedin case of insufficient work

Page 10: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

Qubole Inc., Proprietary

Implementation:Auto-scaledHadoop

http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/

Restrictions on Deleting Nodes:

Nodes Containing Task Outputs of Current Jobs

Fast Decommissioning Done for Data Nodes

Minimum Cluster Size Threshold

Fast Decommissioning - possible becauseHDFS is a cache for us

•---

Page 11: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

Qubole Inc., Proprietary

Compute & Storage on theCloud (EC2/S3)

On the cloud Compute and Storage areSeparate!!

Advantage: Don’t Pay for CPU for Storing Data

Burden: Separation Can Cause Slowness &Variability

Solutions:

Caching File System

Masking S3 Latency

••

--

Page 12: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

Qubole Inc., Proprietary

Caching File Systemhttp://www.qubole.com/blog/index.php/columnar-cloud-cache/

Page 13: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

Qubole Inc., Proprietary

Caching File Systemhttp://www.qubole.com/blog/index.php/columnar-cloud-cache/

Benefits:

Masks the performance variance associated with S3 whilereading data

Columnar caching on the fly enables data to be persisted inopen formats while still giving the benefits of performance

•-

-

Page 14: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

Qubole Inc., Proprietary

Masking S3 Latencyhttp://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/

File Operations in S3 are much slower thanHDFS

Problem: This leads to bad performance whendata is distributed in a lot of files

Solution:

Fast Split Generation Algorithm

Pipelined File Opens

•-

-

Page 15: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

Qubole Inc., Proprietary

Faster Split Generationhttp://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/

Directory operations with merging instead ofper file metadata (upto 8x speedup)

Page 16: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

Qubole Inc., Proprietary

Pipelined File Openshttp://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/

Open S3 files before they are read (30%improvements in simple queries)

Page 17: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

Qubole Inc., Proprietary

Purchasing Instances

Buying Instances on Spot Prices vs On-Demand Prices

Benefits: Cheaper on average by 50-60%

Problems: Spot instances are not guaranteedand can be taken away anytime

Bad for MapReduce

Disastrous for HDFS

••

-

-

Page 18: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

Qubole Inc., Proprietary

Spotted Hadoop Clustershttp://www.qubole.com/blog/index.php/hadoop-auto-scale-ec2-spot-instances/

Simplified Spot Bidding Strategy

Configuring Bidding Timeouts

Configuring % of instances through spot

Configuring bid pricses

Spot Instance Aware HDFS Block Placement

Ensures One Replica of the Blocks Reside On On-DemandNodes

•-

-

-

•-

Page 19: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

Qubole Inc., Proprietary

Conclusion

Cloud is Different from Bare Metal

Check out more optimizations that we havemade to run Hadoop and Hive optimally in thecloud at our blog

••

http://www.qubole.com/blog/

Page 20: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive

Qubole Inc., Proprietary

Thank you.

Free Sign up for Qubole at https://api.qubole.com/users/sign_upCareers at http://www.qubole.com/careers