[rakuten techconf2014] [c-6] leveraging spark for cluster computing
Post on 03-Jul-2015
158 Views
Preview:
DESCRIPTION
TRANSCRIPT
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/
Leveraging For Cluster Computing
Robin M. E. SwezeyRakuten Institute of Technology, Tokyo
Intelligence Domain Group
robin.swezey@mail.rakuten.com
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/2
What is Spark?
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/3
In short, Spark is the future of
open-source MapReduce
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/4
Current Hadoop stack is heterogeneous
Spark = Fully integrated analytics suite and cluster
computing framework
Berkeley AMP lab + Apache Software Foundation
Why Spark?
Apache Hadoop, Hive, Mahout, Pig and their respective logos are trademarks of The Apache Software Foundation.
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/5
On the surface, very similar to Hadoop
• Relies on HDFS
• Runs on Yarn, Mesos, or standalone
• MapReduce + General cluster computing
Platform
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/6
1. Resilient Distributed Dataset (RDD)Central to Spark (R dataframe-ish)
Platform
RDD
RDDRDD
Key differences with usual stack
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/7
1. Resilient Distributed Dataset (RDD)Central to Spark (R dataframe-ish)
Platform
Key differences with usual stack
RDD<String>RDD<Tuple>RDD<Tuple>
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/8
Platform
2. Better resource utilizationDisk is slow. Memory is fast. Several levels of persistence.
Key differences with usual stack
Read blocks
from disk
Cache aggregates
in memory
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/9
Platform
Key differences with usual stack
2. Better resource utilizationMore cores > more machines. Resource locality.
Each node x each core
/ each local block
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/10
Platform
3. Easier development & operationsScala, Java, Python API
(Logistic Regression)
Key differences with usual stack
8 Lines
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/11
Platform
3. Easier AnalyticsInteractive Shells in Scala, Python
Easy to connect with SparkContext (e.g. iPython Notebook)
Key differences with usual stack
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/12
4. Integrated Solution
Easy MapReduce
DBMS-like Functionality
Streaming
Machine Learning
Platform
Apache Hadoop, Hive, Mahout, Pig and their respective logos are trademarks of The Apache Software Foundation.
Key differences with usual stack
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/13
Applications
Thanks to RDD Distributed Operators
map()
reduce()
reduceByKey()
groupBy()
sample()
pipe()
foreach()
fold()
histogram()
…
Easy MapReduce
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/14
Applications
Cf.
Integrated Unified Data Access
Hive Compatible Standard Connectivity
Sped-up Analytics with DBMS-like SQL Functionality
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/15
Applications
Cf.
Streaming
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/16
Applications
Cf.
Statistics
Classification / Regression
Collaborative Filtering
Clustering
Dimensionality Reduction
Feature Extraction
Image: http://en.wikipedia.org/wiki/Machine_learning#mediaviewer/File:Linear-svm-scatterplot.svg
Machine Learning
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/17
Applications
Flexible
FastPageRank
Connected components
Label propagation
SVD++
Triangle count
Graph Processing
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/18
More
There are deployed clusters
of 1,000+ nodes
How does it scale?
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/19
More
Spark 1.1.0
had 171 contributors!
There’s open-source, and there’s highly supported open-source
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/20
In Conclusion
Hadoop Cluster Computing Hype Cycle
Image: http://upload.wikimedia.org/wikipedia/commons/thumb/9/94/Gartner_Hype_Cycle.svg/2000px-Gartner_Hype_Cycle.svg.png
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
All Spark images and code are from http://spark.apache.org/21
Thank you!
http://spark.apache.org
top related