[rakuten techconf2014] [c-6] leveraging spark for cluster computing

Post on 03-Jul-2015

158 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Rakuten Technology Conference 2014 "Leveraging Spark for Cluster Computing" Robin M.E. Swezey (Rakuten)

TRANSCRIPT

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/

Leveraging For Cluster Computing

Robin M. E. SwezeyRakuten Institute of Technology, Tokyo

Intelligence Domain Group

robin.swezey@mail.rakuten.com

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/2

What is Spark?

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/3

In short, Spark is the future of

open-source MapReduce

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/4

Current Hadoop stack is heterogeneous

Spark = Fully integrated analytics suite and cluster

computing framework

Berkeley AMP lab + Apache Software Foundation

Why Spark?

Apache Hadoop, Hive, Mahout, Pig and their respective logos are trademarks of The Apache Software Foundation.

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/5

On the surface, very similar to Hadoop

• Relies on HDFS

• Runs on Yarn, Mesos, or standalone

• MapReduce + General cluster computing

Platform

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/6

1. Resilient Distributed Dataset (RDD)Central to Spark (R dataframe-ish)

Platform

RDD

RDDRDD

Key differences with usual stack

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/7

1. Resilient Distributed Dataset (RDD)Central to Spark (R dataframe-ish)

Platform

Key differences with usual stack

RDD<String>RDD<Tuple>RDD<Tuple>

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/8

Platform

2. Better resource utilizationDisk is slow. Memory is fast. Several levels of persistence.

Key differences with usual stack

Read blocks

from disk

Cache aggregates

in memory

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/9

Platform

Key differences with usual stack

2. Better resource utilizationMore cores > more machines. Resource locality.

Each node x each core

/ each local block

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/10

Platform

3. Easier development & operationsScala, Java, Python API

(Logistic Regression)

Key differences with usual stack

8 Lines

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/11

Platform

3. Easier AnalyticsInteractive Shells in Scala, Python

Easy to connect with SparkContext (e.g. iPython Notebook)

Key differences with usual stack

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/12

4. Integrated Solution

Easy MapReduce

DBMS-like Functionality

Streaming

Machine Learning

Platform

Apache Hadoop, Hive, Mahout, Pig and their respective logos are trademarks of The Apache Software Foundation.

Key differences with usual stack

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/13

Applications

Thanks to RDD Distributed Operators

map()

reduce()

reduceByKey()

groupBy()

sample()

pipe()

foreach()

fold()

histogram()

Easy MapReduce

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/14

Applications

Cf.

Integrated Unified Data Access

Hive Compatible Standard Connectivity

Sped-up Analytics with DBMS-like SQL Functionality

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/15

Applications

Cf.

Streaming

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/16

Applications

Cf.

Statistics

Classification / Regression

Collaborative Filtering

Clustering

Dimensionality Reduction

Feature Extraction

Image: http://en.wikipedia.org/wiki/Machine_learning#mediaviewer/File:Linear-svm-scatterplot.svg

Machine Learning

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/17

Applications

Flexible

FastPageRank

Connected components

Label propagation

SVD++

Triangle count

Graph Processing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/18

More

There are deployed clusters

of 1,000+ nodes

How does it scale?

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/19

More

Spark 1.1.0

had 171 contributors!

There’s open-source, and there’s highly supported open-source

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/20

In Conclusion

Hadoop Cluster Computing Hype Cycle

Image: http://upload.wikimedia.org/wikipedia/commons/thumb/9/94/Gartner_Hype_Cycle.svg/2000px-Gartner_Hype_Cycle.svg.png

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/21

Thank you!

http://spark.apache.org

top related