1 © Cloudera, Inc. All rights reserved.
Cloudera’s Investments in the Spark Ecosystem Mike Olson | Founder and Chief Strategy Officer
2 © Cloudera, Inc. All rights reserved.
Our history with Spark • On the radar since 2009 (Matei Zaharia and the RAD Lab) • See my 2013 blog post (“MapReduce and Spark”) • 1st vendor to ship and support Spark • 6 contributors to Spark v1 (all other Hadoop vendors: zero) • 2+ commiXers (all other Hadoop vendors: zero) • Complemented by Intel’s substanYal & early investment • Working across the project: • Core, Streaming, Security, YARN w Yahoo!, Mllib • Sentry, Hive, Pig, Crunch, Dataflow on Spark • Cloudera Manager, training, PS (6+), UG, books, etc
• Single largest commercial distributor of Spark (per Typesafe/Databricks survey)
3 © Cloudera, Inc. All rights reserved.
Our posiYon on Spark
• Cloudera is a member of, and aligned with, the global Spark community • Spark will replace MapReduce as the general purpose Hadoop framework • Tremendous community – 400 developers across 50 companies • Hadoop ecosystem integraYon (naYve & 3rd party) • Doesn’t mean MapReduce goes away – it will be the historical framework
• Spark is not just for data science / ML • Spark does not replace special purpose frameworks • One size does not fit all for SQL, Search, Graph, Stream
4 © Cloudera, Inc. All rights reserved.
Why Spark MaXers: LogisYc Regression (data fits in memory)
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
Runn
ing Time (s)
Number of Itera5ons
Hadoop
Spark
110 s / iteration
first iteration 80 s further iterations 1 s
5 © Cloudera, Inc. All rights reserved.
In-‐Memory Datasets
Trends ½ price every 18 months 2x bandwidth every 3 years The numbers get even more interesYng with upcoming enhancements to the Intel architecture.
128 – 384 GB
12-‐24 cores
50 GB per sec
Memory an enabler for high performance big data applica5ons
6 © Cloudera, Inc. All rights reserved.
Delivering Spark in Cloudera Enterprise
Hadoop Integra5on • Standard Hadoop data formats • Runs under YARN in mixed clusters • Security Libraries • Mllib – Machine Learning toolkit • GraphX (alpha) – Graph analyYcs
based on PowerGraph abstracYons • Spark Streaming – Near real-‐Yme
analyYcs
Language support: • SparkR (upcoming) • Java 8 • PySpark and pandas interoperability • Dataframe API • Schema support in Spark’s APIs • SQL support in Spark Streaming (upcoming)
7 © Cloudera, Inc. All rights reserved.
Cloudera’s Spark Investments for 2015 Partner of choice for companies doing Spark integraYon Increase our involvement in the community
Community leadership
Complete Hive on Spark Complete Pig on Spark Oozie acYon for Spark (Oozie team) Improve Spark core shuffle primiYves to be equivalent or beXer than MapReduce in all respects Integrate with Google DataFlow Support advanced features such as runYme DAG opYmizaYon
Batch Tool of choice / Replace MR
EDH IntegraYon and cluster ciYzenship
AutomaYc executor launch / destrucYon based on usage ValidaYon of Parquet / Avro with Impala style usage Improved integraYon with HBase to simplify RDD creaYon against HBase tables ATS integraYon for Spark Container resizing with YARN support Tachyon alternaYve in HDFS (dependent on HDFS team prioriYes) + off-‐heap caching
Ease of development
Provide EXPLAIN PLAN primiYves at runYme and compile Yme ProgrammaYc job submission interface Auto-‐compute parYYon model to simplify configuraYon space for users
Enterprise grade
CM integraYon; AMon integraYon; tuning hints; validaYon Parallel split generaYon REST API for Spark History Server Security -‐ EncrypYon: On-‐the-‐wire encrypYon, shuffle encrypYon Security -‐ Navigator: IntegraYon with Audit, Lineage (visible through Hive as well) Scale -‐ Validate Spark at very large scale and improve scalability where issues are found Security -‐ MR / Spark: RecordService for deeper Sentry integraYon Security -‐ AuthorizaYon: Integrate Schema RDDs with Sentry Availability: Spark Streaming availability (mostly complete)
Data science tool of choice
Hue app for Spark (a la Zeppelin, Databricks): Phase 1 Rest based interface to Spark for Hue Oryx2 built on Spark for data science lifecycle management
8 © Cloudera, Inc. All rights reserved.
Cloudera customer use cases – core Spark Sector Use case Replaces
Financial Services
• Value-‐at-‐Risk calculaYons • ETL pipeline speed-‐up • Analyzing stock data for 20 years
Home grown applicaYons
Genomics • IdenYfy genes implicated in disease onset in full human genome
MySQL engine
Data services • Trend analysis using staYsYcal methods on large data sets • Document classificaYon (LDA) • Fraud analyYcs
• Netezza replacement • Net new
ERP • OCR and bill classificaYon Net new
Healthcare • CalculaYng Jaccard scores on health care data sets Net new
9 © Cloudera, Inc. All rights reserved.
Cloudera customer use cases – Streaming Sector Use case Replaces
Financial Services
• On-‐line fraud detecYon Net new
Many • ConYnuous ETL
Retail • On-‐line recommender systems • Inventory management
• Custom apps
10 © Cloudera, Inc. All rights reserved.
Why Cloudera?
• Deep engineering investment – only distribuYon vendor with engineering contribuYons to Spark and actual technical know-‐how
• Field team, support, training and services with experience in many Spark use cases • Driving roadmap for Spark
ExperYse
• Most customers running Spark across all distribuYons put together • Range from few nodes to 800+ nodes • Longest field presence – first vendor to support and sYll only two vendors with official support
Experience
• Intel partnership brings 15 Spark developers focused on Cloudera customer use cases • Business relaYonship with Databricks to do joint development on Spark
Partnerships
11 © Cloudera, Inc. All rights reserved.
Thank you [email protected] @mikeolson