harnessing the power of spark and cassandra in your spring app
TRANSCRIPT
Harnessing the Power of Spark + Cassandra within
your Spring AppSteve Pember
CTO, ThirdChannel @svpember
@spring_io #springio17
RELATIONAL DATABASES ARE FANTASTIC
SQL MAKES YOU STRONG
@spring_io #springio17
@spring_io #springio17
Agenda
• Spark • Cassandra • Spark + Cassandra • Working with Spark + Cassandra • Demo
@spring_io #springio17
Apache Spark
• Distributed Execution Engine
@spring_io #springio17
Apache Spark
• Distributed Execution Engine
• What about Hadoop?
@spring_io #springio17
Hadoop Spark
• Map / Reduce • Storage via HDFS • Each calculation
step written to disk
• More than Map/Reduce
• No dependent storage mechanism
• Clustered Calculations, each step in memory
@spring_io #springio17
Apache Spark
• Distributed Execution Engine
• What about Hadoop?
• Creation was a Happy Accident
@spring_io #springio17
@spring_io #springio17
@spring_io #springio17
Apache Spark
• Distributed Execution Engine
• What about Hadoop?
• Creation was a Happy Accident
• Architecture
@spring_io #springio17
@spring_io #springio17
Your Spring App
@spring_io #springio17
Apache Spark
• Distributed Execution Engine
• What about Hadoop?
• Creation was a Happy Accident
• Architecture
• Programatic structure
@spring_io #springio17
THE SPARKCONTEXT SUBMITS JOBS TO THE CLUSTER
@spring_io #springio17
OPERATIONS ARE PERFORMED AGAINST RDDS
@spring_io #springio17
Resilient Distributed Dataset
• Immutable • Partitioned • Parallel operations • Created by performing operations on
other RDDs • Reusable & Composable
@spring_io #springio17
@spring_io #springio17
Apache Spark
• Distributed Execution Engine
• What about Hadoop?
• Creation was a Happy Accident
• Architecture
• Programatic structure
• APIs
@spring_io #springio17
MORE THAN MAP/REDUCE
@spring_io #springio17
RDD operations
• map • reduce • aggregate • filter • flatmap • join • … plus many more
@spring_io #springio17
@spring_io #springio17
Apache Spark
• Distributed Execution Engine
• What about Hadoop?
• Creation was a Happy Accident
• Architecture
• Programatic structure
• APIs
• Additional Modules
@spring_io #springio17
SPARK SQL…!
@spring_io #springio17
@spring_io #springio17
JDBC?
@spring_io #springio17
SPARK STREAMING!
@spring_io #springio17
@spring_io #springio17
Agenda
• Spark
• Cassandra
@spring_io #springio17
Apache Cassandra (C*)
• NoSql Datastore
@spring_io #springio17
Apache Cassandra (C*)
• NoSql Datastore
• Distributed
@spring_io #springio17
DETERMINISTIC DISTRIBUTION
@spring_io #springio17
@spring_io #springio17
Apache Cassandra (C*)
• NoSql Datastore
• Distributed
• High Replication
@spring_io #springio17
@spring_io #springio17
@spring_io #springio17
Apache Cassandra (C*)
• NoSql Datastore
• Distributed
• High Replication
• High Durability
@spring_io #springio17
@spring_io #springio17
Apache Cassandra (C*)
• NoSql Datastore
• Distributed
• High Replication
• High Durability
• Linear Scalability
@spring_io #springio17
EACH NEW NODE RESULTS IN INCREASED STORAGE WITH NO LOSS IN PERFORMANCE
@spring_io #springio17
@spring_io #springio17
Apache Cassandra (C*)
• NoSql Datastore
• Distributed
• High Replication
• High Durability
• Linear Scalability
• Data Model (CQL)
@spring_io #springio17
COLUMN ORIENTED DATABASE
@spring_io #springio17
BUT IT’S SQL-LIKE!
@spring_io #springio17
@spring_io #springio17
@spring_io #springio17
@spring_io #springio17
QUERYING
@spring_io #springio17
C* Querying
• select * from … • all queries must include partition key(s) • order by limited to group keys
@spring_io #springio17
Apache Cassandra (C*)
• NoSql Datastore
• Distributed
• High Replication
• High Durability
• Linear Scalability
• Data Model (CQL)
• Designing your Data Model
@spring_io #springio17
@spring_io #springio17
@spring_io #springio17
Agenda
• Spark
• Cassandra
• Spark + Cassandra
@spring_io #springio17
Spark + Cassandra
– Reduce each other’s weaknesses – Filter on the server side (with c*) – Join tables, filter results (with Spark)
@spring_io #springio17
COMPANIES HAVE BEEN FORMED
@spring_io #springio17
CLUSTER DESIGN
@spring_io #springio17
DATA LOCALITY!
@spring_io #springio17
@spring_io #springio17
@spring_io #springio17
PIPELINE ARCHITECTURE
@spring_io #springio17
@spring_io #springio17
Agenda
• Spark
• Cassandra
• Spark + Cassandra
• Working with Spark + Cassandra
@spring_io #springio17
OPTIONS FOR SPRING?
@spring_io #springio17
@spring_io #springio17
BUT WE DIDN’T GO THAT ROUTE
@spring_io #springio17
Our Excuses
• Wanted to take full advantage of Spark + C* connector
• Our setup / pipeline is relatively minimal • Programming model is easy
@spring_io #springio17
@spring_io #springio17
CODING SPARK + C*
@spring_io #springio17
• SparkConf • JavaSparkContext • JavaFunctions • Mappers
@spring_io #springio17
@spring_io #springio17
Spark Conf• spark.master -> url to the master node • spark.app.name -> want to see your client show up in
the Spark UI? • spark.executor.memory -> Limits memory per
executor on workers • spark.executor.cores -> limits cores on each worker
(need to share with c*!) • spark.submit.deployMode -> ‘client’ or ‘cluster • spark.jars.packages -> maven / gradle type names • spark.jars.ivy -> specify custom repos for packages • more at: http://spark.apache.org/docs/latest/
configuration.html#available-properties
@spring_io #springio17
Master Url Overloading
• “local” -> use Spark in stand alone mode. One thread
• “local[<K>]” -> Spark, stand alone, with K threads
• “local[*]” -> Spark, stand alone, with ALL YOUR THREADS!
• “spark://<host string>:<port>” -> url for a Spark cluster master node, using Spark’s cluster management
• also options for Mesos and Yarn
@spring_io #springio17
@spring_io #springio17
HOWEVER, A WARNING
@spring_io #springio17
MOST DIFFICULT PART: WHERE DOES MY CODE LIVE?
@spring_io #springio17
@spring_io #springio17
CLASS_PATH: org.apache.spark, com.fasterxml.jackson, com.yourco.yourapp.pojos.*
CLASS_PATH: org.apache.spark, com.fasterxml.jackson
CLASS_PATH: org.apache.spark, com.fasterxml.jackson
@spring_io #springio17
Agenda
• Spark
• Cassandra
• Spark + Cassandra
• Working with Spark + Cassandra
• Demo
Thank You!
@svpember
@spring_io #springio17
Links• Cassandra on AWS official Whitepaper: https://d0.awsstatic.com/whitepapers/Cassandra_on_AWS.pdf
• Demo Code project link: https://github.com/spember/spark-cass-spring-demo
@spring_io #springio17
Images• Database Sharding: https://dzone.com/articles/ebay-secret-database-scaling
• Indian Jones Warehouse: http://logisticalfictions.tumblr.com/page/9
• Strong (Spongebob): www.reactiongifs.com/strongbob/?utm_source=rss&utm_medium=rss&utm_campaign=strongbob
• Cheetah: www.livescience.com/21944-usain-bolt-vs-cheetah-animal-olympics.html
• Big Data Cartoon: http://www.kdnuggets.com/2016/08/cartoon-make-data-great-again.html
• Spark Streaming: http://velvia.github.io/presentations/2015-filodb-spark-streaming/#/
• Picard + Riker: http://www.douxreviews.com/2015/09/star-trek-next-generation-matter-of.html
• Software Engineers: http://pyxurz.blogspot.com/2011/10/office-space-page-2-of-6.html
• Throwing Money: https://vimeo.com/132892478