couchbase live europe 2015: big data analytics with couchbase including hadoop, kafka, spark and...

46
Big Data Analytics with Couchbase Hadoop, Kafka, Spark and More Matt Ingenthron, Sr. Director Michael Nitschinger, Software Engineer

Upload: couchbase

Post on 22-Aug-2015

253 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Big Data Analytics with CouchbaseHadoop, Kafka, Spark and More

Matt Ingenthron, Sr. Director

Michael Nitschinger, Software Engineer

Page 2: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

2

• Define Problem Domain• How Couchbase fits in• Demo• Q&A

Agenda

Page 3: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More
Page 4: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

4

Lambda Architecture

1

2

3

4

5

DATA

BATCH

SPEED

SERVE

QUERY

Page 5: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

5

Lambda Architecture

Interactive and Real Time Applications

1

2

3

4

5

DATA

BATCH

SPEED

SERVE

QUERYHADOOP

COUCHBASESTORM

COUCHBASEBrokerCluster

Spout for Topic

Kafka Producers

Ordered Subscriptions

Page 6: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More
Page 7: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

COMPLEX EVENT PROCESSING

Real TimeREPOSITORY

PERPETUALSTORE

ANALYTICALDB

BUSINESSINTELLIGENCE

MONITORING

CHAT/VOICESYSTEM

BATCHTRACK

REAL-TIMETRACK

DASHBOARD

Page 8: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

TRACKING and COLLECTION

ANALYSIS ANDVISUALIZATION

REST FILTER METRICS

Page 9: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Integration at Scale

Page 10: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

11

Requirements for data streaming in modern systems…

• Must support high throughput and low latency• Need to handle failures

• Pick up where you left off• Be efficient about resource usage

Page 11: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Data Sync is the Heart of Any Big Data System

Fundamental piece of the architecture- Data Sync maintains Data Redundancy for High

Availability (HA) & Disaster Recovery (DR)- Protect against failures – node, rack, region etc.

- Data Sync maintains Indexes- Indexing is key to building faster access paths to query data- Spatial, Full-text

DCP and Couchbase Server Architecture

Page 12: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

14

What is DCP?DCP is an innovative protocol that drive data sync for Couchbase Server

• Increase data sync efficiency with massive data footprints • Remove slower Disk-IO from the data sync path• Improve latencies – replication for data durability• In future, will provide a programmable data sync protocol for

external stores outside Couchbase Server

DCP powers many critical components

What is DCP?

Page 13: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Demo

Page 14: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

27

Shopper Tracking(click stream)

Lightweight Analytics:• Department shopped• Tech platform• Click tracks by Income

Heavier Analytics, Develop Profiles

Page 15: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

28

Page 16: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

29

And at scale…

Page 17: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More
Page 18: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Couchbase & Apache Spark

Introduction & Integration

Page 19: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

What is Spark?

Apache is a fast and general engine for large-scale data processing.

Page 20: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Spark Components

Spark Core: RDDs, Clustering, Execution, Fault Management

Page 21: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Spark Components

Spark SQL: Work with structured data, distributed SQL querying

Page 22: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Spark Components

Spark Streaming: Build fault-tolerant streaming applications

Page 23: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Spark Components

Mlib: Machine Learning built in

Page 24: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Spark Components

GraphX: Graph processing and graph-parallel computations

Page 25: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Spark Benefits

• Linear Scalability• Ease of Use• Fault Tolerance

• For developers and data scientists• Tight but not mandatory Hadoop integration

Page 26: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Spark Facts

• Current Release: 1.3.0• Over 450 contributors, most active Apache Big Data project.• Huge public interest:

Source: http://www.google.com/trends/explore?hl=en-US#q=apache%20spark,%20apache%20hadoop&cmpt=q

Page 27: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Daytona GraySort Performance

Hadoop MR Record Spark Record

Data Size 102.5 TB 100 TB

Elapsed Time 72 mins 23 mins

# Nodes 2100 206

# Cores 50400 physical 6592 virtual

Cluster Disk Throughput 3150 GB/s 618 GB/s

Network Dedicated DC, 10Gbps EC2, 10Gbps

Sort Rate 1.42 TB/min 4.27 TB/min

Sort Rate/Node 0.67 GB/min 20.7 GB/min

Source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.htmlBenchmark: http://sortbenchmark.org/

Page 28: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

How does it work?

Resilient Distributed Datatypes paper: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

RDDCreation

SchedulingDAG

TaskExecution

Page 29: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

How does it work?

Resilient Distributed Datatypes paper: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

RDDCreation

SchedulingDAG

TaskExecution

Page 30: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

How does it work?

Resilient Distributed Datatypes paper: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

RDDCreation

SchedulingDAG

TaskExecution

Page 31: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Spark vs Hadoop

• Spark is RAM while Hadoop is HDFS (disk) bound

• API easier to reason about & to develop against

• Fully compatible with Hadoop Input/Output formats

• Hadoop more mature, Spark ecosystem growing fast

Page 32: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Ecosystem Flexibility

RDBMS

StreamsWeb APIs

DCPKVN1QLViews

BatchingArchived Data

OLTP

Page 33: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Infrastructure Consolidation

StreamsWeb APIs

UserInteraction

Page 34: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Couchbase Connector

Spark Core Automatic Cluster and Resource Management Creating and Persisting RDDs Java APIs in addition to Scala (planned before GA)

Spark SQL Easy JSON handling and querying Tight N1QL Integration (dp2)

Spark Streaming Persisting DStreams DCP source (planned before GA)

Page 35: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Connector Facts

• Current Version: 1.0.0-dp• DP2 upcoming• GA planned for Q3

Code:https://github.com/couchbaselabs/couchbase-spark-connector

Docs until GA: https://github.com/couchbaselabs/couchbase-spark-connector/wiki

Page 36: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Questions

Page 37: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Matt Ingenthron @ingenthr

Michael Nitschinger @daschl

Thanks

Page 38: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Additional Slides

Page 39: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Use Case at Linkedin

52

Page 40: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

• Site Reliability Engineer (SRE) at LinkedIn• SRE for Profile & Higher-Education• Member of LinkedIn’s CBVT• B.E. (Electrical Engineering) from

the University of Queensland,Australia

Michael Kehoe

Page 41: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

• Kafka was created by LinkedIn• Kafka is a publish-subcribe system built as a distributed commit log• Processes 500+ TB/ day (~500 billion messages) @ LinkedIn

Kafka @ LinkedIn

Page 42: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

• Monitoring• InGraphs

• Traditional Messaging (Pub-Sub)

• Analytics• Who Viewed my Profile• Experiment reports• Executive reports

• Building block for (log) distributibuted applications• Pinot• Espresso

LinkedIn’s uses of Kafka

Page 43: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Use Case: Kafka to Hadoop (Analytics)

• LinkedIn tracks data to better understand how members use our products

• Information such as which page got viewed and which content got clicked on are sent into a Kafka cluster in each data center

• Some of these events are all centrally collected and pushed onto our Hadoop grid for analy sis and daily report generation

Page 44: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Couchbase @ LinkedIn

• About 25 separate services with one or more clusters in multiple data centers

• Up to 100 servers in a cluster

• Single and Multi-tenant clusters

Page 45: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Use Case: Jobs Cluster

• Read scaling, Couchbase ~80k QPS, 24 server cluster(s)• Hadoop to pre-build data by partition• Couchbase 99 percentile latencies

Page 46: Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

Hadoop to Couchbase

• Our primary use-case for Hadoop Couchbase is for building (warming) / recovering Couchbase buckets

• LinkedIn built it’s own in-house solution to work with our ETL processes, cache invalidation procedures etc