cloud optimized big data

21
Cloud-Optimized Big-Data as a Service Joydeep Sen Sarma Co-Founder Qubole, Apache-Hive

Upload: joydeep-sen-sarma

Post on 11-Jun-2015

227 views

Category:

Engineering


5 download

DESCRIPTION

What makes a big-data platform 'cloud-optimized'. Here's our (Qubole's) shot at it. @Cloud-Asia 2014.

TRANSCRIPT

Page 1: Cloud Optimized Big Data

Cloud-Optimized Big-Data as a Service

Joydeep Sen SarmaCo-Founder Qubole, Apache-Hive

Page 2: Cloud Optimized Big Data

About Me

• @Facebook (2007-2011):– First Hadoop Engineer– Founder - Apache Hive project, PMC Member– Contributor to Apache Hadoop/HBase

• Founder Qubole (2012-)– Hadoop-as-a-Service– 30+ customers: Pinterest, Quora, Mediamath, Tubemogul …– Design/Code/Ops/Support/…

Page 3: Cloud Optimized Big Data

Big Data Cloud

• Elasticity:– Workloads are Bursty– Allows easy rolling upgrades and testing

• Lower TCO:– Cloud Storage is Inexpensive (2-3c/GB/month – globally replicated)– Zero cost to try new projects– Upgrade to new hardware easily (no cluster migrations!)

Page 4: Cloud Optimized Big Data

Big Data Cloud

• Global:– Easily set up where employees/customer/entities are located

• Collaboration:– Zero-Copy sharing of data with Partners and across Departments– Easy access to great public data sets

• As-a-Service delivery model vastly lowers Operational Cost

Page 5: Cloud Optimized Big Data

Cloud-Optimized Big Data?

• Optimized for lower TCO

• Optimized for Speed

• Optimized for Operations/Support

Page 6: Cloud Optimized Big Data

Cloud-Optimized Big Data

Optimized for lower TCO

Page 7: Cloud Optimized Big Data

7

select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county;

insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) group by a.id, a.zip;

hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache…

AdCo Hadoop

Automated LifeCycle Mgmt

Page 8: Cloud Optimized Big Data

insert overwrite table dest select … from ads join campaigns on …group by …;

8

StarCluster

Map Tasks

ReduceTasks

Demand

Supply

AWS

Progress

Master

Slaves

Job Tracker

Auto-Scaling

Page 9: Cloud Optimized Big Data

9

Spot Instances

On an average 50-60% cheaper

• Fallback to regular instances when Spot unavailable

• Replace regular instances with Spot when available

Page 10: Cloud Optimized Big Data

10

Using Fast but ‘Thin’ nodes

• C3 instances: 50% better performance at 20% lower cost• Little local storage

Page 11: Cloud Optimized Big Data

11

Using Fast but ‘Thin’ nodes

Modify Hadoop to use Network drives for overflow

Map-Reduce HDFS

LocalSSD

Network Drives

Disk I/O

Overflow

Page 12: Cloud Optimized Big Data

Cloud-Optimized Big Data

Optimized for Speed

Page 13: Cloud Optimized Big Data

• Optimize I/O to AWS S3– Faster Split Computation (8x)– Prefetching S3 files (30%)– Zero-Copy writes to S3

• JVM Reuse (1.2-2x speedup)

• Columnar File Caches on local disks (1.2-2x speedup)

• 30-50% cost savings because of cluster consolidation

Faster, Faster ..

Page 14: Cloud Optimized Big Data

• 5x Faster than nearest competitor (Hive against S3)

• 30-50% cost savings because of cluster consolidation

Faster, Faster ..

Page 15: Cloud Optimized Big Data

• Presto-as-a-Service – 3-22x faster SQL against S3– (as tested by customer)

• 30-50% cost savings because of cluster consolidation

Faster, Faster ..

Page 16: Cloud Optimized Big Data

Cloud-Optimized Big Data

Optimized for Operations/Support

Page 17: Cloud Optimized Big Data

Rolling Upgrades

• @Facebook – we spent months upgrading large cluster• @Qubole: Start new cluster, Reassign label

• 30-50% cost savings because of cluster consolidation

Page 18: Cloud Optimized Big Data

Support

CHATEMail

Page 19: Cloud Optimized Big Data

Visually browse Historical Jobs

Page 20: Cloud Optimized Big Data

Visually browse Historical Jobs

Page 21: Cloud Optimized Big Data

Questions?

[email protected]@jsensarma

www.qubole.com