streaming analytics with spark, kafka, cassandra and akka by helena edelson
TRANSCRIPT
![Page 1: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/1.jpg)
Streaming Analytics with Spark, Kafka, Cassandra, and Akka
Helena Edelson VP of Product Engineering @Tuplejump
![Page 2: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/2.jpg)
• Committer / Contributor: Akka, FiloDB, Spark Cassandra Connector, Spring Integration
• VP of Product Engineering @Tuplejump
• Previously: Sr Cloud Engineer / Architect at VMware, CrowdStrike, DataStax and SpringSource
[email protected]/helena
![Page 3: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/3.jpg)
Tuplejump Open Source github.com/tuplejump
• FiloDB - distributed, versioned, columnar analytical db for modern streaming workloads
• Calliope - the first Spark-Cassandra integration • Stargate - Lucene indexer for Cassandra • SnackFS - HDFS-compatible file system for Cassandra
3
![Page 4: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/4.jpg)
What Will We Talk About• The Problem Domain • Example Use Case • Rethinking Architecture
– We don't have to look far to look back – Streaming – Revisiting the goal and the stack – Simplification
![Page 5: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/5.jpg)
THE PROBLEM DOMAINDelivering Meaning From A Flood Of Data
5
![Page 6: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/6.jpg)
The Problem DomainNeed to build scalable, fault tolerant, distributed data processing systems that can handle massive amounts of data from disparate sources, with different data structures.
6
![Page 7: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/7.jpg)
TranslationHow to build adaptable, elegant systems for complex analytics and learning tasks to run as large-scale clustered dataflows
7
![Page 8: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/8.jpg)
How Much Data
Yottabyte = quadrillion gigabytes or septillion bytes
8
We all have a lot of data • Terabytes • Petabytes...
https://en.wikipedia.org/wiki/Yottabyte
![Page 9: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/9.jpg)
Delivering Meaning• Deliver meaning in sec/sub-sec latency • Disparate data sources & schemas • Billions of events per second • High-latency batch processing • Low-latency stream processing • Aggregation of historical from the stream
![Page 10: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/10.jpg)
While We Monitor, Predict & Proactively Handle
• Massive event spikes • Bursty traffic • Fast producers / slow consumers • Network partitioning & Out of sync systems • DC down • Wait, we've DDOS'd ourselves from fast streams? • Autoscale issues
– When we scale down VMs how do we not loose data?
![Page 11: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/11.jpg)
And stay within our AWS / Rackspace budget
![Page 12: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/12.jpg)
EXAMPLE CASE: CYBER SECURITY
Hunting The Hunter
12
![Page 13: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/13.jpg)
13
• Track activities of international threat actor groups, nation-state, criminal or hactivist • Intrusion attempts • Actual breaches
• Profile adversary activity • Analysis to understand their motives, anticipate actions
and prevent damage
Adversary Profiling & Hunting
![Page 14: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/14.jpg)
14
• Machine events • Endpoint intrusion detection • Anomalies/indicators of attack or compromise
• Machine learning • Training models based on patterns from historical data • Predict potential threats • profiling for adversary Identification
•
Stream Processing
![Page 15: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/15.jpg)
Data Requirements & Description• Streaming event data
• Log messages • User activity records • System ops & metrics data
• Disparate data sources • Wildly differing data structures
15
![Page 16: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/16.jpg)
Massive Amounts Of Data
16
• One machine can generate 2+ TB per day • Tracking millions of devices • 1 million writes per second - bursty • High % writes, lower % reads • TTL
![Page 17: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/17.jpg)
RETHINKING ARCHITECTURE
17
![Page 18: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/18.jpg)
WE DON'T HAVE TO LOOK FAR TO LOOK BACK
18
Rethinking Architecture
![Page 19: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/19.jpg)
19
Most batch analytics flow from several years ago looked like...
![Page 20: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/20.jpg)
STREAMING & DATA SCIENCE
20
Rethinking Architecture
![Page 21: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/21.jpg)
StreamingI need fast access to historical data on the fly for predictive modeling with real time data from the stream.
21
![Page 22: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/22.jpg)
Not A Stream, A Flood• Data emitters
• Netflix: 1 - 2 million events per second at peak • 750 billion events per day
• LinkedIn: > 500 billion events per day • Data ingesters
• Netflix: 50 - 100 billion events per day • LinkedIn: 2.5 trillion events per day
• 1 Petabyte of streaming data22
![Page 23: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/23.jpg)
Which Translates To• Do it fast • Do it cheap • Do it at scale
23
![Page 24: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/24.jpg)
Challenges• Code changes at runtime • Distributed Data Consistency • Ordering guarantees • Complex compute algorithms
24
![Page 25: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/25.jpg)
Oh, and don't lose data
25
![Page 26: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/26.jpg)
Strategies• Partition For Scale & Data Locality • Replicate For Resiliency • Share Nothing • Fault Tolerance • Asynchrony • Async Message Passing • Memory Management
26
• Data lineage and reprocessing in runtime • Parallelism • Elastically Scale • Isolation • Location Transparency
![Page 27: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/27.jpg)
AND THEN WE GREEKED OUT
27
Rethinking Architecture
![Page 28: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/28.jpg)
Lambda ArchitectureA data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods.
28
![Page 29: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/29.jpg)
Lambda ArchitectureA data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods.
• An approach • Coined by Nathan Marz • This was a huge stride forward
29
![Page 30: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/30.jpg)
30https://www.mapr.com/developercentral/lambda-architecture
![Page 31: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/31.jpg)
![Page 32: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/32.jpg)
Implementing Is Hard
32
• Real-time pipeline backed by KV store for updates • Many moving parts - KV store, real time, batch • Running similar code in two places • Still ingesting data to Parquet/HDFS • Reconcile queries against two different places
![Page 33: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/33.jpg)
Performance Tuning & Monitoring: on so many systems
33
Also hard
![Page 34: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/34.jpg)
Lambda ArchitectureAn immutable sequence of records is captured and fed into a batch system and a stream processing system in parallel.
34
![Page 35: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/35.jpg)
WAIT, DUAL SYSTEMS?
35
Challenge Assumptions
![Page 36: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/36.jpg)
Which Translates To• Performing analytical computations & queries in dual
systems • Implementing transformation logic twice • Duplicate Code • Spaghetti Architecture for Data Flows • One Busy Network
36
![Page 37: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/37.jpg)
Why Dual Systems?• Why is a separate batch system needed? • Why support code, machines and running services of
two analytics systems?
37
Counter productive on some level?
![Page 38: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/38.jpg)
YES
38
• A unified system for streaming and batch • Real-time processing and reprocessing
• Code changes • Fault tolerance
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html - Jay Kreps
![Page 39: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/39.jpg)
ANOTHER ASSUMPTION: ETL
39
Challenge Assumptions
![Page 40: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/40.jpg)
Extract, Transform, Load (ETL)
40
"Designing and maintaining the ETL process is often considered one of the most difficult and resource-intensive portions of a data warehouse project."
http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm
![Page 41: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/41.jpg)
Extract, Transform, Load (ETL)
41
ETL involves • Extraction of data from one system into another • Transforming it • Loading it into another system
![Page 42: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/42.jpg)
Extract, Transform, Load (ETL)"Designing and maintaining the ETL process is often
considered one of the most difficult and resource-intensive portions of a data warehouse project."
http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm
42
Also unnecessarily redundant and often typeless
![Page 43: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/43.jpg)
ETL
43
• Each ETL step can introduce errors and risk • Can duplicate data after failover • Tools can cost millions of dollars • Decreases throughput • Increased complexity
![Page 44: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/44.jpg)
ETL
• Writing intermediary files • Parsing and re-parsing plain text
44
![Page 45: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/45.jpg)
And let's duplicate the pattern over all our DataCenters
45
![Page 46: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/46.jpg)
46
These are not the solutions you're looking for
![Page 47: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/47.jpg)
REVISITING THE GOAL & THE STACK
47
![Page 48: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/48.jpg)
Removing The 'E' in ETLThanks to technologies like Avro and Protobuf we don’t need the “E” in ETL. Instead of text dumps that you need to parse over multiple systems:
Scala & Avro (e.g.)• Can work with binary data that remains strongly typed• A return to strong typing in the big data ecosystem
48
![Page 49: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/49.jpg)
Removing The 'L' in ETLIf data collection is backed by a distributed messaging system (e.g. Kafka) you can do real-time fanout of the ingested data to all consumers. No need to batch "load".
• From there each consumer can do their own transformations
49
![Page 50: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/50.jpg)
#NoMoreGreekLetterArchitectures
50
![Page 51: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/51.jpg)
NoETL
51
![Page 52: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/52.jpg)
Strategy Technologies
Scalable Infrastructure / Elastic Spark, Cassandra, Kafka
Partition For Scale, Network Topology Aware Cassandra, Spark, Kafka, Akka Cluster
Replicate For Resiliency Spark,Cassandra, Akka Cluster all hash the node ring
Share Nothing, Masterless Cassandra, Akka Cluster both Dynamo style
Fault Tolerance / No Single Point of Failure Spark, Cassandra, Kafka
Replay From Any Point Of Failure Spark, Cassandra, Kafka, Akka + Akka Persistence
Failure Detection Cassandra, Spark, Akka, Kafka
Consensus & Gossip Cassandra & Akka Cluster
Parallelism Spark, Cassandra, Kafka, Akka
Asynchronous Data Passing Kafka, Akka, Spark
Fast, Low Latency, Data Locality Cassandra, Spark, Kafka
Location Transparency Akka, Spark, Cassandra, Kafka
My Nerdy Chart
52
![Page 53: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/53.jpg)
SMACK• Scala/Spark • Mesos • Akka • Cassandra • Kafka
53
![Page 54: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/54.jpg)
Spark Streaming
54
![Page 55: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/55.jpg)
Spark Streaming• One runtime for streaming and batch processing
• Join streaming and static data sets • No code duplication • Easy, flexible data ingestion from disparate sources to
disparate sinks • Easy to reconcile queries against multiple sources • Easy integration of KV durable storage
55
![Page 56: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/56.jpg)
How do I merge historical data with data in the stream?
56
![Page 57: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/57.jpg)
Join Streams With Static Dataval ssc = new StreamingContext(conf, Milliseconds(500)) ssc.checkpoint("checkpoint")
val staticData: RDD[(Int,String)] = ssc.sparkContext.textFile("whyAreWeParsingFiles.txt").flatMap(func) val stream: DStream[(Int,String)] = KafkaUtils.createStream(ssc, zkQuorum, group, Map(topic -> n)) .transform { events => events.join(staticData)) .saveToCassandra(keyspace,table)
ssc.start()
57
![Page 58: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/58.jpg)
Training Data
Feature Extraction
Model Training
Model Testing
Test Data
Your Data Extract Data To Analyze
Train your model to predict
Spark MLLib
58
![Page 59: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/59.jpg)
Spark Streaming & ML
59
val context = new StreamingContext(conf, Milliseconds(500)) val model = KMeans.train(dataset, ...) // learn offline
val stream = KafkaUtils .createStream(ssc, zkQuorum, group,..) .map(event => model.predict(event.feature))
![Page 60: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/60.jpg)
Apache MesosOpen-source cluster manager developed at UC Berkeley.
Abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.
60
![Page 61: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/61.jpg)
AkkaHigh performance concurrency framework for Scala and Java • Fault Tolerance • Asynchronous messaging and data processing • Parallelization • Location Transparency • Local / Remote Routing • Akka: Cluster / Persistence / Streams
61
![Page 62: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/62.jpg)
Akka ActorsA distribution and concurrency abstraction • Compute Isolation • Behavioral Context Switching • No Exposed Internal State • Event-based messaging • Easy parallelism • Configurable fault tolerance
62
![Page 63: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/63.jpg)
63
Akka Actor Hierarchy
http://www.slideshare.net/jboner/building-reactive-applications-with-akka-in-scala
![Page 64: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/64.jpg)
import akka.actor._ class NodeGuardianActor(args...) extends Actor with SupervisorStrategy {
val temperature = context.actorOf( Props(new TemperatureActor(args)), "temperature")
val precipitation = context.actorOf( Props(new PrecipitationActor(args)), "precipitation")
override def preStart(): Unit = { /* lifecycle hook: init */ }
def receive : Actor.Receive = { case Initialized => context become initialized } def initialized : Actor.Receive = { case e: SomeEvent => someFunc(e) case e: OtherEvent => otherFunc(e) } }
64
![Page 65: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/65.jpg)
Apache Cassandra• Extremely Fast • Extremely Scalable • Multi-Region / Multi-Datacenter • Always On
• No single point of failure • Survive regional outages
• Easy to operate • Automatic & configurable replication 65
![Page 66: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/66.jpg)
Apache Cassandra• Very flexible data modeling (collections, user defined
types) and changeable over time • Perfect for ingestion of real time / machine data • Huge community
66
![Page 67: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/67.jpg)
Spark Cassandra Connector
• NOSQL JOINS! • Write & Read data between Spark and Cassandra • Compatible with Spark 1.4 • Handles Data Locality for Speed • Implicit type conversions • Server-Side Filtering - SELECT, WHERE, etc. • Natural Timeseries Integration
67
http://github.com/datastax/spark-cassandra-connector
![Page 68: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/68.jpg)
KillrWeather
68
http://github.com/killrweather/killrweather
A reference application showing how to easily integrate streaming and batch data processing with Apache Spark Streaming, Apache Cassandra, Apache Kafka and Akka for fast, streaming computations on time series data in asynchronous event-driven environments.
http://github.com/databricks/reference-apps/tree/master/timeseries/scala/timeseries-weather/src/main/scala/com/databricks/apps/weather
![Page 69: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/69.jpg)
69
• High Throughput Distributed Messaging • Decouples Data Pipelines • Handles Massive Data Load • Support Massive Number of Consumers • Distribution & partitioning across cluster nodes • Automatic recovery from broker failures
![Page 70: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/70.jpg)
Spark Streaming & Kafkaval context = new StreamingContext(conf, Seconds(1))
val wordCount = KafkaUtils.createStream(context, ...) .flatMap(_.split(" ")) .map(x => (x, 1)) .reduceByKey(_ + _)
wordCount.saveToCassandra(ks,table) context.start() // start receiving and computing
70
![Page 71: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/71.jpg)
71
class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext) extends AggregationActor(settings: Settings) { import settings._ val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2) .map(_._2.split(",")) .map(RawWeatherData(_)) kafkaStream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)
/** RawWeatherData: wsid, year, month, day, oneHourPrecip */ kafkaStream.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip)) .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)
/** Now the [[StreamingContext]] can be started. */ context.parent ! OutputStreamInitialized def receive : Actor.Receive = {…} }
Gets the partition key: Data LocalitySpark C* Connector feeds this to Spark
Cassandra Counter column in our schema,no expensive `reduceByKey` needed. Simply let C* do it: not expensive and fast.
![Page 72: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/72.jpg)
72
/** For a given weather station, calculates annual cumulative precip - or year to date. */class PrecipitationActor(ssc: StreamingContext, settings: WeatherSettings) extends AggregationActor { def receive : Actor.Receive = { case GetPrecipitation(wsid, year) => cumulative(wsid, year, sender) case GetTopKPrecipitation(wsid, year, k) => topK(wsid, year, k, sender) } /** Computes annual aggregation.Precipitation values are 1 hour deltas from the previous. */ def cumulative(wsid: String, year: Int, requester: ActorRef): Unit = ssc.cassandraTable[Double](keyspace, dailytable) .select("precipitation") .where("wsid = ? AND year = ?", wsid, year) .collectAsync() .map(AnnualPrecipitation(_, wsid, year)) pipeTo requester /** Returns the 10 highest temps for any station in the `year`. */ def topK(wsid: String, year: Int, k: Int, requester: ActorRef): Unit = { val toTopK = (aggregate: Seq[Double]) => TopKPrecipitation(wsid, year, ssc.sparkContext.parallelize(aggregate).top(k).toSeq) ssc.cassandraTable[Double](keyspace, dailytable) .select("precipitation") .where("wsid = ? AND year = ?", wsid, year) .collectAsync().map(toTopK) pipeTo requester }}
![Page 73: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/73.jpg)
A New Approach• One Runtime: streaming, scheduled • Simplified architecture • Allows us to
• Write different types of applications • Write more type safe code • Write more reusable code
73
![Page 74: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/74.jpg)
Need daily analytics aggregate reports? Do it in the stream, save results in Cassandra for easy reporting as needed - with data locality not offered by S3.
![Page 75: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/75.jpg)
FiloDBDistributed, columnar database designed to run very fast analytical queries
• Ingest streaming data from many streaming sources • Row-level, column-level operations and built in versioning
offer greater flexibility than file-based technologies • Currently based on Apache Cassandra & Spark • github.com/tuplejump/FiloDB
75
![Page 76: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/76.jpg)
FiloDB• Breakthrough performance levels for analytical queries
• Performance comparable to Parquet • One to two orders of magnitude faster than Spark on
Cassandra 2.x • Versioned - critical for reprocessing logic/code changes • Can simplify your infrastructure dramatically • Queries run in parallel in Spark for scale-out ad-hoc analysis • Space-saving techniques
76
![Page 77: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/77.jpg)
WRAPPING UP
77
![Page 78: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/78.jpg)
Architectyr?
78"This is a giant mess"
- Going Real-time - Data Collection and Stream Processing with Apache Kafka, Jay Kreps
![Page 79: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/79.jpg)
79
Simplified
![Page 80: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/80.jpg)
80
![Page 82: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/82.jpg)
82
@helenaedelsongithub.com/helenaslideshare.net/helenaedelson
THANK YOU!
![Page 83: Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson](https://reader030.vdocuments.net/reader030/viewer/2022020108/58ac4e361a28ab99028b642f/html5/thumbnails/83.jpg)
I'm speaking at QCon SF on the broader topic of Streaming at Scale
http://qconsf.com/sf2015/track/streaming-data-scale
83