spark with couchbase to electrify your data processing: couchbase connect 2015

27
SPARK WITH COUCHBASE TO ELECTRIFY YOUR DATA PROCESSING Michael Nitschinger, Couchbase

Upload: couchbase

Post on 11-Aug-2015

120 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

SPARK WITH COUCHBASETO ELECTRIFY YOUR DATA PROCESSING

Michael Nitschinger, Couchbase

Page 2: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

What is Spark?

Page 3: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

©2015 Couchbase Inc. 3

Introduction

Apache Spark is a fast and general engine for large-scale data processing.

Page 4: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

©2015 Couchbase Inc. 4

More Facts Over 450 contributors, very active Apache Big Data

project. Huge public interest:

Source: http://www.google.com/trends/explore?hl=en-US#q=apache%20spark,%20apache%20hadoop&cmpt=q

Page 5: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

©2015 Couchbase Inc. 5

Community

Ecosystem growing fast Hadoop RDBMS NoSQL

Package Repository http://spark-packages.org/ Connectors Utility Libraries

Page 6: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

©2015 Couchbase Inc. 6

Components: Spark Core

Resilient Distributed DatasetsClusteringExecution

Page 7: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

©2015 Couchbase Inc. 7

Components: Spark SQL

Structured Data FramesDistributed querying with SQL

Page 8: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

©2015 Couchbase Inc. 8

Components: Spark Streaming

Fault-tolerant streaming applications

Page 9: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

©2015 Couchbase Inc. 9

Components: Spark MLib

Built-In Machine Learning Algorithms

Page 10: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

©2015 Couchbase Inc. 10

Components: Spark GraphX

Graph processing and graph-parallel computations

Page 11: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

©2015 Couchbase Inc. 11

How does it work? Resilient Distributed Datatypes paper:

https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

rdd1.join(rdd2) .groupBy(…) .filter(…)

RDD Objects

build DAG

agnostic to operators!

doesn’t know about stages

DAGScheduler

split graph into stages of tasks

submit each stage as ready

DAG

TaskScheduler

TaskSet

launch tasks via cluster manager

retry failed or straggling tasks

Clustermanager

Worker

execute tasks

store and serve blocks

Block manager

ThreadsTask

stagefailed

Page 12: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

Why should you care?

Page 13: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

©2015 Couchbase Inc. 13

Spark Benefits

Linearly scalable to 1000+ worker nodes Simpler to use than Hadoop MR Only partial recompute on failure

For developers and data scientists machine learning R integration

Tight but not mandatory Hadoop integration Sources, Sinks Scheduler

Page 14: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

©2015 Couchbase Inc. 14

Spark vs Hadoop

Spark is RAM while Hadoop is mainly HDFS (disk) bound

Fully compatible with Hadoop Input/Output

Easier to develop against thanks to functional composition

Hadoop certainly more mature, but Spark ecosystem growing fast

Page 15: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

©2015 Couchbase Inc. 15

Ecosystem Flexibility

RDBMS

StreamsWeb APIs

DCPKVN1QLViews

BatchingData Archive

OLTP Data

Page 16: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

©2015 Couchbase Inc. 16

Infrastructure Consolidation

Page 17: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

The Couchbase Spark Connector

Page 18: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

©2015 Couchbase Inc. 18

Couchbase Connector Spark Core

Automatic Cluster and Resource Management Creating and Persisting RDDs Java APIs in addition to Scala (planned before GA)

Spark SQL Easy JSON handling and querying Tight N1QL Integration (partially in dp2, fully planned before

GA)

Spark Streaming Persisting DStreams DCP source (partially in dp2, fully planned before GA)

Page 19: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

©2015 Couchbase Inc. 19

Facts Current Version: 1.0.0-dp2 Beta in July, GA in Q3 (tentative)

Code: https://github.com/couchbaselabs/couchbase-spark-connector

Docs until GA: https://github.com/couchbaselabs/couchbase-spark-connector/wiki

Page 20: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

©2015 Couchbase Inc. 20

Connection Management

Page 21: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

©2015 Couchbase Inc. 21

Connection Management

Page 22: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

©2015 Couchbase Inc. 22

Creating RDDs

Page 23: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

©2015 Couchbase Inc. 23

Persisting RDDs

Page 24: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

©2015 Couchbase Inc. 24

Spark SQL Integration

Page 25: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

©2015 Couchbase Inc. 25

Spark Streaming with DCP

Page 26: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

Questions?

Page 27: Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

Thank you.