jan 2013 hug: cloud-friendly hadoop and hive

Qubole Inc., Proprietary

Hadoop User GroupAshish ThusooJan 16, 2013


About Me

Big Data Veteran

Ran the data infrastructure team at Facebookbefore starting Qubole

Co-created Hive in 2007 @ Facebook

••

•


What is Qubole?

A comprehensive cloud data platform basedon Hadoop and Hive for data in the cloud

Turnkey Infrastructure

Cloud Optimized Stack

Open Data Formats

Useful for exploring data and creating batchprocessing applications/data pipelines

•

---

•


Why Qubole?

End Users(User Ops, Product Managers

etc.)

Heterogenous Data(Structured & Unstructured)

The Intermediaries(Data Scientists and

Engineers)

BOTTLENECK


Qubole Service

Cloud Data Service

Cloud Data PlatformElastic . Robust . Fast

DataMarts

Explore Schedule SDK

EC2 / S3

Big Data Technology Stack

ODBC

Connectors

API

Logs

Events

DBs

Metrics

Cloud Sources


Cloud vs Bare Metal

Dynamic vs Fixed Provisioning

Separation between Compute and Storage

Purchasing and Budgeting

•••


Dynamic Provisioning

Advantage: Transient Clusters

Burden: How big of a cluster do I need?

Solution: Auto-scaled Hadoop

•••


Challenges:Auto-scaledHadoop

http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/

Adapting to Burstiness

Current load is not enough, also need to predict futureload

Adapting State-fully

Removing HDFS nodes is risky withoutdecommissioning

•-

•-


Implementation:Auto-scaledHadoop


TaskTrackers report launch times ofJobTracker

JT computes amount of time required tofinish existing workloads

If the time is above a certain threshold thenmore nodes are added

At hourly boundaries the nodes are removedin case of insufficient work

•

•

•

•


Implementation:Auto-scaledHadoop


Restrictions on Deleting Nodes:

Nodes Containing Task Outputs of Current Jobs

Fast Decommissioning Done for Data Nodes

Minimum Cluster Size Threshold

Fast Decommissioning - possible becauseHDFS is a cache for us

•---

•


Compute & Storage on theCloud (EC2/S3)

On the cloud Compute and Storage areSeparate!!

Advantage: Don’t Pay for CPU for Storing Data

Burden: Separation Can Cause Slowness &Variability

Solutions:

Caching File System

Masking S3 Latency

•

••

•

--


Caching File Systemhttp://www.qubole.com/blog/index.php/columnar-cloud-cache/


Caching File Systemhttp://www.qubole.com/blog/index.php/columnar-cloud-cache/

Benefits:

Masks the performance variance associated with S3 whilereading data

Columnar caching on the fly enables data to be persisted inopen formats while still giving the benefits of performance

•-

-


Masking S3 Latencyhttp://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/

File Operations in S3 are much slower thanHDFS

Problem: This leads to bad performance whendata is distributed in a lot of files

Solution:

Fast Split Generation Algorithm

Pipelined File Opens

•

•

•-

-


Faster Split Generationhttp://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/

Directory operations with merging instead ofper file metadata (upto 8x speedup)

•


Pipelined File Openshttp://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/

Open S3 files before they are read (30%improvements in simple queries)

•


Purchasing Instances

Buying Instances on Spot Prices vs On-Demand Prices

Benefits: Cheaper on average by 50-60%

Problems: Spot instances are not guaranteedand can be taken away anytime

Bad for MapReduce

Disastrous for HDFS

•

••

-

-


Spotted Hadoop Clustershttp://www.qubole.com/blog/index.php/hadoop-auto-scale-ec2-spot-instances/

Simplified Spot Bidding Strategy

Configuring Bidding Timeouts

Configuring % of instances through spot

Configuring bid pricses

Spot Instance Aware HDFS Block Placement

Ensures One Replica of the Blocks Reside On On-DemandNodes

•-

-

-

•-


Conclusion

Cloud is Different from Bare Metal

Check out more optimizations that we havemade to run Hadoop and Hive optimally in thecloud at our blog

••

http://www.qubole.com/blog/


Thank you.

Free Sign up for Qubole at https://api.qubole.com/users/sign_upCareers at http://www.qubole.com/careers

jan 2013 hug: cloud-friendly hadoop and hive

Documents

data infrastructure

big data veteran

cloud compute

caching file system

hdfs nodes

file operations

cloud ec2s3on

spot instances