feeding cassandra with spark-streaming and kafka

21
Feeding Cassandra with Spark Streaming & Kafka Cary Bourgeois Solutions Engineer DataStax, Central Region

Upload: datastax-academy

Post on 11-Apr-2017

1.205 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Feeding Cassandra with Spark-Streaming and Kafka

Feeding Cassandra with Spark Streaming & KafkaCary Bourgeois Solutions Engineer DataStax, Central Region

Page 2: Feeding Cassandra with Spark-Streaming and Kafka

Who Am I

• Datastax < 2 Years • Not a “developer” • Legacy BI/Database

• Business Objects• SAP

• Demo Development • R • Java (If I have to) • Scala (Someday)

2

Page 3: Feeding Cassandra with Spark-Streaming and Kafka

3

Cassandra Summit 2015 September 22-24, Santa Clara Convention Center

7,000 Attendees

Page 4: Feeding Cassandra with Spark-Streaming and Kafka

Last Week - Mission Impossible?A Stretch but possible.

4

Page 5: Feeding Cassandra with Spark-Streaming and Kafka

Sunday Afternoon - I’m getting my A$$ kicked

5

Page 6: Feeding Cassandra with Spark-Streaming and Kafka

Monday Afternoon - Arghhhhh!

6

Page 7: Feeding Cassandra with Spark-Streaming and Kafka

Monday Night - I got this!

7

Page 8: Feeding Cassandra with Spark-Streaming and Kafka

8

Capture Raw Data

Analyze & ∑ummarize

Page 9: Feeding Cassandra with Spark-Streaming and Kafka

Why Mess with Success?

• Spark 1.3+ • New/Improved Kafka

Support • Dataframes

• Datastax Enterprise 4.8 • Spark 1.4 support

9

https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

Page 10: Feeding Cassandra with Spark-Streaming and Kafka

Why Mess with Success?

• Spark 1.3+ • New/Improved Kafka

Support • Dataframes

• Datastax Enterprise 4.8 • Spark 1.4 support

10https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

Page 11: Feeding Cassandra with Spark-Streaming and Kafka

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.FastA single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.ScalableKafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumersDurableMessages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.Distributed by DesignKafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees. 11

Page 12: Feeding Cassandra with Spark-Streaming and Kafka

• Producers • Consumers • Persistence • Topics • Partitions • Replication

12

http://kafka.apache.org/documentation.html

Page 13: Feeding Cassandra with Spark-Streaming and Kafka

• Create a Kafka topic bin/kafka-topics.sh --zookeeper localhost:2181 --create --replication-factor 1 --partitions 1 --topic stream_ts

• List all topics bin/kafka-topics.sh --zookeeper localhost:2181 --list

• Monitor a topic bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic stream_ts --from-beginning

13

Page 14: Feeding Cassandra with Spark-Streaming and Kafka

Confidential

Kafka and the Producer

14

Page 15: Feeding Cassandra with Spark-Streaming and Kafka

The Producer App

• Lots of Options • I chose

• Scala • Not steep enough

• Akka

• Producing this message

15

Edge 1;1;401843;2015-11-04 06:23:49.001;64.44286233060423;82.79653847181152

Page 16: Feeding Cassandra with Spark-Streaming and Kafka

Destination - Cassandra Tables

16

CREATE TABLE demo.data (edge_id text,sensor text,epoch_hr text,ts timestamp,depth double,value double,PRIMARY KEY (( edge_id, sensor, epoch_hr ), ts)

)

CREATE TABLE demo.last (edge_id text,sensor text,ts timestamp,depth double,value double,PRIMARY KEY (( edge_id, sensor ))

)

CREATE TABLE demo.count (pk int,ts timestamp,count bigint,count_ma double,PRIMARY KEY (pk, ts)

)

Page 17: Feeding Cassandra with Spark-Streaming and Kafka

DSE Analytics => Spark

• No ETL • Spark 1.4.1 certification • Simplified map and reduce • Very developer Friendly

• SparkSQL • Spark Streaming • Machine Learning

• DSE Analytics and Search Integration • Cassandra benefits (scaling, availability)

17

“I want to do processing on data before it hits Cassandra.” “I need my sums, avgs, group by’s ETC.” “I want to run real-time analytics on my Cassandra data.”

Page 18: Feeding Cassandra with Spark-Streaming and Kafka

Processing the Stream

• Simple Scala Job • Deal with the raw flow

• Capture the raw data • Capture the latest sensor

reading • Summarize and Analyze

• Windowing the Stream • Count Records every x

seconds • Calculate a moving average

of every x seconds over a number of periods. 18

Page 19: Feeding Cassandra with Spark-Streaming and Kafka

Confidential

Full Demo

19

Page 20: Feeding Cassandra with Spark-Streaming and Kafka

Next Steps

• SparkR • MLLib workflows • Notebooks

• Spark • Jupyter

20

Page 21: Feeding Cassandra with Spark-Streaming and Kafka

If you would like the code:

21

https://github.com/CaryBourgeois/KafkaSparkCassandraDemo