spark with couchbase to electrify your data processing: couchbase connect 2015
TRANSCRIPT
SPARK WITH COUCHBASETO ELECTRIFY YOUR DATA PROCESSING
Michael Nitschinger, Couchbase
What is Spark?
©2015 Couchbase Inc. 3
Introduction
Apache Spark is a fast and general engine for large-scale data processing.
©2015 Couchbase Inc. 4
More Facts Over 450 contributors, very active Apache Big Data
project. Huge public interest:
Source: http://www.google.com/trends/explore?hl=en-US#q=apache%20spark,%20apache%20hadoop&cmpt=q
©2015 Couchbase Inc. 5
Community
Ecosystem growing fast Hadoop RDBMS NoSQL
Package Repository http://spark-packages.org/ Connectors Utility Libraries
©2015 Couchbase Inc. 6
Components: Spark Core
Resilient Distributed DatasetsClusteringExecution
©2015 Couchbase Inc. 7
Components: Spark SQL
Structured Data FramesDistributed querying with SQL
©2015 Couchbase Inc. 8
Components: Spark Streaming
Fault-tolerant streaming applications
©2015 Couchbase Inc. 9
Components: Spark MLib
Built-In Machine Learning Algorithms
©2015 Couchbase Inc. 10
Components: Spark GraphX
Graph processing and graph-parallel computations
©2015 Couchbase Inc. 11
How does it work? Resilient Distributed Datatypes paper:
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
rdd1.join(rdd2) .groupBy(…) .filter(…)
RDD Objects
build DAG
agnostic to operators!
doesn’t know about stages
DAGScheduler
split graph into stages of tasks
submit each stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via cluster manager
retry failed or straggling tasks
Clustermanager
Worker
execute tasks
store and serve blocks
Block manager
ThreadsTask
stagefailed
Why should you care?
©2015 Couchbase Inc. 13
Spark Benefits
Linearly scalable to 1000+ worker nodes Simpler to use than Hadoop MR Only partial recompute on failure
For developers and data scientists machine learning R integration
Tight but not mandatory Hadoop integration Sources, Sinks Scheduler
©2015 Couchbase Inc. 14
Spark vs Hadoop
Spark is RAM while Hadoop is mainly HDFS (disk) bound
Fully compatible with Hadoop Input/Output
Easier to develop against thanks to functional composition
Hadoop certainly more mature, but Spark ecosystem growing fast
©2015 Couchbase Inc. 15
Ecosystem Flexibility
RDBMS
StreamsWeb APIs
DCPKVN1QLViews
BatchingData Archive
OLTP Data
©2015 Couchbase Inc. 16
Infrastructure Consolidation
The Couchbase Spark Connector
©2015 Couchbase Inc. 18
Couchbase Connector Spark Core
Automatic Cluster and Resource Management Creating and Persisting RDDs Java APIs in addition to Scala (planned before GA)
Spark SQL Easy JSON handling and querying Tight N1QL Integration (partially in dp2, fully planned before
GA)
Spark Streaming Persisting DStreams DCP source (partially in dp2, fully planned before GA)
©2015 Couchbase Inc. 19
Facts Current Version: 1.0.0-dp2 Beta in July, GA in Q3 (tentative)
Code: https://github.com/couchbaselabs/couchbase-spark-connector
Docs until GA: https://github.com/couchbaselabs/couchbase-spark-connector/wiki
©2015 Couchbase Inc. 20
Connection Management
©2015 Couchbase Inc. 21
Connection Management
©2015 Couchbase Inc. 22
Creating RDDs
©2015 Couchbase Inc. 23
Persisting RDDs
©2015 Couchbase Inc. 24
Spark SQL Integration
©2015 Couchbase Inc. 25
Spark Streaming with DCP
Questions?
Thank you.