spark cassandra connector dataframes

Cassandra And Spark Dataframes

Russell Spitzer Software Engineer @ Datastax

Cassandra And Spark Dataframes

Tungsten Gives Dataframes OffHeap Power!

Can compare memory off-heap and bitwise! Code generation!

The Core is the Cassandra Source

https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra

/** * Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]] * It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down * some filters to CQL * */

DataFrame

source org.apache.spark.sql.cassandra

The Core is the Cassandra Source

https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra

/** * Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]] * It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down * some filters to CQL * */

DataFrameCassandraSourceRelation

CassandraTableScanRDDConfiguration

Configuration Can Be Done on a Per Source Level

clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128")

val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load()

Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32

Namespace: default Keyspace: test

spark.cassandra.input.split.size_in_mb=128

val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "default" ) ).load()

val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "other" , "cluster" -> "default" ) ).load()

Connector Default

Predicate Pushdown Is Automatic!

Select * From cassandraTable where clusteringKey > 100

DataFrame DataFromC*

Filter clusteringKey > 100

Catalyst

https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala

DataFrame DataFromC* AND

add where clause to CQL

"clusteringKey > 100"

Catalyst

https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala

What can be pushed down?

1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate 2. Only push down primary key column predicates with = or IN predicate. 3. If there are regular columns in the pushdown predicates, they should have at least one EQ

expression on an indexed column and no IN predicates. 4. All partition column predicates must be included in the predicates to be pushed down, only

the last part of the partition key can be an IN predicate. For each partition column, only one predicate is allowed.

5. For cluster column predicates, only last predicate can be non-EQ predicate including IN predicate, and preceding column predicates must be EQ predicates.

6. If there is only one cluster column predicate, the predicates could be any non-IN predicate. There is no pushdown predicates if there is any OR condition or NOT IN condition.

7. We're not allowed to push down multiple predicates for the same column if any of them is equality or IN predicate.

What can be pushed down?

If you could write in CQL it will get pushed down.

What are we Pushing Down To?

CassandraTableScanRDD

All of the underlying code is the same as with sc.cassandraTable so everything with Reading and Writing

applies

What are we Pushing Down To?

CassandraTableScanRDD

All of the underlying code is the same as with sc.cassandraTable so everything with Reading and Writing

applies

https://academy.datastax.com/ Watch me talk about this in the privacy of your own home!

How the Spark Cassandra Connector

Reads Data

Spark RDDs Represent a Large

Amount of Data Partitioned into Chunks

7 8 9Node 2

Node 1 Node 3

Node 4

Node 2

Node 1

Node 3

Node 4

Node 2

Node 1

Node 3

Node 4

Cassandra Data is Distributed By Token Range

Node 1

Node 2

Node 3

Node 4

Node 1

Node 2

Node 3

Node 4

Without vnodes

Node 1

Node 2

Node 3

Node 4

With vnodes

Node 1

120-220300-500780-830

spark.cassandra.input.split_size_in_mb 1

Reported density is 100 tokens per mb

The Connector Uses Information on the Node to Make Spark Partitions

Node 1

120-220300-500

780-830

Node 1

120-220

300-500

780-830

Node 1 300-500

780-830

Node 1 300-500

780-830

Node 1

300-400

780-830400-500

Node 1

780-830400-500

Node 1

780-830400-500

Node 1

780-830

400-500

Node 1

780-830

Node 1

780-830

Node 1

780-830

Node 1

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50780-830

Node 1

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

780-830

Node 1

780-830

Node 1

50 CQL Rows

780-830

Node 1

50 CQL Rows

780-830

Node 1

50 CQL Rows50 CQL Rows

780-830

Node 1

50 CQL Rows50 CQL Rows

780-830

Node 1

50 CQL Rows50 CQL Rows50 CQL Rows

780-830

Node 1

50 CQL Rows50 CQL Rows50 CQL Rows

780-830

Node 1

50 CQL Rows50 CQL Rows50 CQL Rows 50 CQL Rows

780-830

Node 1

50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

780-830

Node 1

50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows 50 CQL Rows

780-830

Node 1

50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows 50 CQL Rows

780-830

Node 1

50 CQL Rows

780-830

Node 1

50 CQL Rows

50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

780-830

Node 1

50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

How The Spark Cassandra Connector

Writes Data

7 8 9Node 2

Node 1 Node 3

Node 4

Node 2

Node 1

Node 3

Node 4

Node 2

Node 1

Node 3

Node 4

The Spark Cassandra Connector saveToCassandra

method can be called on almost all RDDs

rdd.saveToCassandra("Keyspace","Table")

Node 11

A Java Driver connection is made to the local node and a prepared statement

is built for the target table

Java Driver

Node 11

Batches are built from data in Spark partitions

Java Driver

Node 11

By default these batches only contain CQL Rows which share the same

partition key

Java Driver

11,4, spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5

Node 11 Java Driver

1,1,11,2,1

By default these batches only contain CQL Rows which share the same

partition key

Node 11

When an element is not part of an existing batch, a new batch is started

Java Driver

1,1,1 1,2,1

spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5

Node 11 Java Driver

1,1,1 1,2,1

Node 11 Java Driver

1,1,1 1,2,1

Node 11 Java Driver

1,1,1 1,2,1

3,8,13,2,1 3,4,1 3,5,1

If a batch size reaches batch.size.rows or batch.size.bytes

it is executed by the driver

Node 11 Java Driver

1,1,1 1,2,1

3,8,13,2,1 3,4,1 3,5,1

Node 11 Java Driver

1,1,1 1,2,1

11,4,3,9,1

spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5

Node 11 Java Driver

1,1,1 1,2,1

11,4,3,9,1 spark.cassandra.output.batch.grouping.key partition spark.cassandra.output.batch.size.rows 4 spark.cassandra.output.batch.buffer.size 3 spark.cassandra.output.concurrent.writes 2 spark.cassandra.output.throughput_mb_per_sec 5

Node 11

If more than batch.buffer.size batches are currently being made,

the largest batch is executed by the Java Driver

Java Driver

1,1,1 1,2,1

Node 11 Java Driver

Node 11

If more batches are currently being executed by the Java driver than concurrent.writes, we

wait until one of the requests has been completed.

Java Driver

2,4,18,4,1

3,9,13,9,1

Node 11

Java Driver

2,4,18,4,1

3,9,13,9,1

Write Acknowledged PK=2

Node 11

Java Driver

Node 11

Java Driver

Node 11

Java Driver

Node 11

Java Driver

Node 11

The last parameter throughput_mb_per_sec blocks further batches if we have written more than

that much in the past second.

Java Driver

Node 11

Java Driver

Write Acknowledged

Node 11

Java Driver

Node 11

Java Driver

Write Acknowledged

Node 11

Java Driver

Node 11

Java Driver

Thanks for Coming and I hope you Have a Great Time At C* Summit

http://cassandrasummit-datastax.com/agenda/the-spark-cassandra-connector-past-present-and-future/

Also ask these guys really hard questions

Jacek PiotrAlex

spark cassandra connector dataframes

Software

using spark over cassandra

advanced apache spark meetup spark sql + dataframes +...

introduction to cassandra • why spark - apache cassandra |...

spark dataframes for data munging

1. spark dataframes + sql - systems group · 2019-06-11 ·...

analytics with cassandra & spark

introducing dataframes in spark for large scale data science

1. spark dataframes + sql€¦ · spark + mongodb 1. spark...

beyond sql: speeding up spark with dataframes

performance analysis of spark using k-means · like...

spark/cassandra integration theory & practicedoanduyhai...

introduction to cassandra • why spark + cassandra ... ·...

cassandra spark connector

spark & cassandra - devfest córdoba

kafka spark - cassandra

scotland data science meetup oct 13, 2015: spark sql,...

cassandra day sv 2014: spark, shark, and apache cassandra

structuring spark: dataframes, datasets, and streaming by...

cassandra and spark sql

an overview of spark dataframes with scala