spark cassandra connector dataframes

Post on 13-Apr-2017

3.201 Views

Category:

Software

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Cassandra And Spark Dataframes

Russell Spitzer Software Engineer @ Datastax

Cassandra And Spark Dataframes

Cassandra And Spark Dataframes

Cassandra And Spark Dataframes

Cassandra And Spark Dataframes

Tungsten Gives Dataframes OffHeap Power!

Can compare memory off-heap and bitwise! Code generation!

The Core is the Cassandra Source

https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra

/** * Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]] * It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down * some filters to CQL * */

DataFrame

source org.apache.spark.sql.cassandra

The Core is the Cassandra Source

https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra

/** * Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]] * It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down * some filters to CQL * */

DataFrameCassandraSourceRelation

CassandraTableScanRDDConfiguration

Configuration Can Be Done on a Per Source Level

clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128")

val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load()

Configuration Can Be Done on a Per Source Level

clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128")

val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load()

Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32

Configuration Can Be Done on a Per Source Level

clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128")

val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load()

Namespace: default Keyspace: test

spark.cassandra.input.split.size_in_mb=128

Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32

Configuration Can Be Done on a Per Source Level

clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128")

val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "ClusterOne" ) ).load()

Namespace: default Keyspace: test

spark.cassandra.input.split.size_in_mb=128

Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32

Configuration Can Be Done on a Per Source Level

clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128")

val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "test" , "cluster" -> "default" ) ).load()

Namespace: default Keyspace: test

spark.cassandra.input.split.size_in_mb=128

Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32

Configuration Can Be Done on a Per Source Level

clusterName:keyspaceName/propertyName. Example Changing Cluster/Keyspace Level Properties val conf = new SparkConf() .set("ClusterOne/spark.cassandra.input.split.size_in_mb","32") .set("default:test/spark.cassandra.input.split.size_in_mb","128")

val lastdf = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "words", "keyspace" -> "other" , "cluster" -> "default" ) ).load()

Namespace: default Keyspace: test

spark.cassandra.input.split.size_in_mb=128

Namespace: ClusterOne spark.cassandra.input.split.size_in_mb=32

Connector Default

Predicate Pushdown Is Automatic!

Select * From cassandraTable where clusteringKey > 100

Predicate Pushdown Is Automatic!

Select * From cassandraTable where clusteringKey > 100

DataFrame DataFromC*

Filter clusteringKey > 100

Show

Predicate Pushdown Is Automatic!

Select * From cassandraTable where clusteringKey > 100

DataFrame DataFromC*

Filter clusteringKey > 100

Show

Catalyst

Predicate Pushdown Is Automatic!

Select * From cassandraTable where clusteringKey > 100

DataFrame DataFromC*

Filter clusteringKey > 100

Show

Catalyst

https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala

Predicate Pushdown Is Automatic!

Select * From cassandraTable where clusteringKey > 100

DataFrame DataFromC* AND

add where clause to CQL

"clusteringKey > 100"

Show

Catalyst

https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala

What can be pushed down?

1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate 2. Only push down primary key column predicates with = or IN predicate. 3. If there are regular columns in the pushdown predicates, they should have at least one EQ

expression on an indexed column and no IN predicates. 4. All partition column predicates must be included in the predicates to be pushed down, only

the last part of the partition key can be an IN predicate. For each partition column, only one predicate is allowed.

5. For cluster column predicates, only last predicate can be non-EQ predicate including IN predicate, and preceding column predicates must be EQ predicates.

6. If there is only one cluster column predicate, the predicates could be any non-IN predicate. There is no pushdown predicates if there is any OR condition or NOT IN condition.

7. We're not allowed to push down multiple predicates for the same column if any of them is equality or IN predicate.

What can be pushed down?

If you could write in CQL it will get pushed down.

What are we Pushing Down To?

CassandraTableScanRDD

All of the underlying code is the same as with sc.cassandraTable so everything with Reading and Writing

applies

What are we Pushing Down To?

CassandraTableScanRDD

All of the underlying code is the same as with sc.cassandraTable so everything with Reading and Writing

applies

https://academy.datastax.com/ Watch me talk about this in the privacy of your own home!

How the Spark Cassandra Connector

Reads Data

Spark RDDs Represent a Large

Amount of Data Partitioned into Chunks

RDD

1 2 3

4 5 6

7 8 9Node 2

Node 1 Node 3

Node 4

Node 2

Node 1

Spark RDDs Represent a Large

Amount of Data Partitioned into Chunks

RDD

2

346

7 8 9

Node 3

Node 4

1 5

Node 2

Node 1

RDD

2

346

7 8 9

Node 3

Node 4

1 5

Spark RDDs Represent a Large

Amount of Data Partitioned into Chunks

Cassandra Data is Distributed By Token Range

Cassandra Data is Distributed By Token Range

0

500

Cassandra Data is Distributed By Token Range

0

500

999

Cassandra Data is Distributed By Token Range

0

500

Node 1

Node 2

Node 3

Node 4

Cassandra Data is Distributed By Token Range

0

500

Node 1

Node 2

Node 3

Node 4

Without vnodes

Cassandra Data is Distributed By Token Range

0

500

Node 1

Node 2

Node 3

Node 4

With vnodes

Node 1

120-220300-500780-830

0-50

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

The Connector Uses Information on the Node to Make Spark Partitions

Node 1

120-220300-500

0-50

The Connector Uses Information on the Node to Make Spark Partitions

1

780-830

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

1

Node 1

120-220

300-500

0-50

The Connector Uses Information on the Node to Make Spark Partitions

780-830

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

2

1

Node 1 300-500

0-50

The Connector Uses Information on the Node to Make Spark Partitions

780-830

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

2

1

Node 1 300-500

0-50

The Connector Uses Information on the Node to Make Spark Partitions

780-830

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

2

1

Node 1

300-400

0-50

The Connector Uses Information on the Node to Make Spark Partitions

780-830400-500

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

21

Node 1

0-50

The Connector Uses Information on the Node to Make Spark Partitions

780-830400-500

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

21

Node 1

0-50

The Connector Uses Information on the Node to Make Spark Partitions

780-830400-500

3

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

21

Node 1

0-50

The Connector Uses Information on the Node to Make Spark Partitions

780-830

3

400-500

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

21

Node 1

0-50

The Connector Uses Information on the Node to Make Spark Partitions

780-830

3

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

4

21

Node 1

0-50

The Connector Uses Information on the Node to Make Spark Partitions

780-830

3

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

4

21

Node 1

0-50

The Connector Uses Information on the Node to Make Spark Partitions

780-830

3

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

421

Node 1

The Connector Uses Information on the Node to Make Spark Partitions

3

spark.cassandra.input.split_size_in_mb  1

Reported  density  is  100  tokens  per  mb

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50780-830

Node 1

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows50 CQL Rows

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows50 CQL Rows

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows50 CQL Rows 50 CQL Rows

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows 50 CQL Rows

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows 50 CQL Rows

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

50 CQL Rows

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

50 CQL Rows

50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

How The Spark Cassandra Connector

Writes Data

Spark RDDs Represent a Large

Amount of Data Partitioned into Chunks

RDD

1 2 3

4 5 6

7 8 9Node 2

Node 1 Node 3

Node 4

Node 2

Node 1

Spark RDDs Represent a Large

Amount of Data Partitioned into Chunks

RDD

2

346

7 8 9

Node 3

Node 4

1 5

Node 2

Node 1

RDD

2

346

7 8 9

Node 3

Node 4

1 5

The Spark Cassandra Connector saveToCassandra

method can be called on almost all RDDs

rdd.saveToCassandra("Keyspace","Table")

Node 11

A Java Driver connection is made to the local node and a prepared statement

is built for the target table

Java Driver

Node 11

Batches are built from data in Spark partitions

Java Driver

1,1,1

1,2,1

2,1,1

3,8,1

3,2,1

3,4,1

3,5,1

3,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

3,9,1

Node 11

By default these batches only contain CQL Rows which share the same

partition key

Java Driver

1,1,1

1,2,1

2,1,1

3,8,1

3,2,1

3,4,1

3,5,1

3,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

Node 11 Java Driver

1,1,11,2,1

2,1,1

3,8,1

3,2,1

3,4,1

3,5,1

3,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

By default these batches only contain CQL Rows which share the same

partition key

PK=1

Node 11

When an element is not part of an existing batch, a new batch is started

Java Driver

1,1,1 1,2,1

2,1,1

3,8,1

3,2,1

3,4,1

3,5,1

3,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4,

spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

PK=1

Node 11 Java Driver

1,1,1 1,2,1

2,1,1

3,8,1

3,2,1

3,4,1

3,5,1

3,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

When an element is not part of an existing batch, a new batch is started

PK=1

PK=2

Node 11 Java Driver

1,1,1 1,2,1

2,1,1

3,8,1

3,2,1

3,4,1

3,5,1

3,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

When an element is not part of an existing batch, a new batch is started

PK=1

PK=2

Node 11 Java Driver

1,1,1 1,2,1

2,1,1

3,8,13,2,1 3,4,1 3,5,1

3,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

If a batch size reaches batch.size.rows or batch.size.bytes

it is executed by the driver

PK=1

PK=2

PK=3

Node 11 Java Driver

1,1,1 1,2,1

2,1,1

3,8,13,2,1 3,4,1 3,5,1

3,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

PK=1

PK=2

PK=3

If a batch size reaches batch.size.rows or batch.size.bytes

it is executed by the driver

Node 11 Java Driver

1,1,1 1,2,1

2,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4,3,9,1

3,1,1

spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

If a batch size reaches batch.size.rows or batch.size.bytes

it is executed by the driver

PK=1

PK=2

Node 11 Java Driver

1,1,1 1,2,1

2,1,1

3,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4,3,9,1 spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

If a batch size reaches batch.size.rows or batch.size.bytes

it is executed by the driver

PK=1

PK=2

PK=3

Node 11

If more than batch.buffer.size batches are currently being made,

the largest batch is executed by the Java Driver

Java Driver

1,1,1 1,2,1

2,1,1

3,1,1

1,4,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

PK=1

PK=2

PK=3

Node 11 Java Driver

2,1,1

3,1,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

PK=2

PK=3

If more than batch.buffer.size batches are currently being made,

the largest batch is executed by the Java Driver

Node 11 Java Driver

2,1,1

3,1,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

If more than batch.buffer.size batches are currently being made,

the largest batch is executed by the Java Driver

PK=2

PK=3

PK=5

Node 11 Java Driver

2,1,1

3,1,1

5,4,1

2,4,1

8,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,1

If more than batch.buffer.size batches are currently being made,

the largest batch is executed by the Java Driver

PK=2

PK=3

PK=5

Node 11

If more batches are currently being executed by the Java driver than concurrent.writes, we

wait until one of the requests has been completed.

Java Driver

2,1,1

3,1,1

5,4,1

2,4,18,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,13,9,1

PK=2

PK=3

PK=5

Node 11

If more batches are currently being executed by the Java driver than concurrent.writes, we

wait until one of the requests has been completed.

Java Driver

2,1,1

3,1,1

5,4,1

2,4,18,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,9,13,9,1

Write Acknowledged PK=2

PK=3

PK=5

Node 11

If more batches are currently being executed by the Java driver than concurrent.writes, we

wait until one of the requests has been completed.

Java Driver

2,1,1

3,1,1

5,4,1

2,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

8,4,1

3,9,1

PK=2

PK=3

PK=5

Node 11

If more batches are currently being executed by the Java driver than concurrent.writes, we

wait until one of the requests has been completed.

Java Driver

3,1,1

5,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

8,4,1

3,9,1

PK=3

PK=5

Node 11

If more batches are currently being executed by the Java driver than concurrent.writes, we

wait until one of the requests has been completed.

Java Driver

3,1,1

5,4,1

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

8,4,1

3,9,1

PK=8

PK=3

PK=5

Node 11

If more batches are currently being executed by the Java driver than concurrent.writes, we

wait until one of the requests has been completed.

Java Driver

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,1,1

5,4,1

8,4,1

3,9,1

PK=8

PK=3

PK=5

Node 11

The last parameter throughput_mb_per_sec blocks further batches if we have written more than

that much in the past second.

Java Driver

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,1,1

5,4,1

8,4,1

3,9,1

PK=8

PK=3

PK=5

Node 11

The last parameter throughput_mb_per_sec blocks further batches if we have written more than

that much in the past second.

Java Driver

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,1,1

5,4,1

8,4,1

3,9,1

PK=8

PK=3

PK=5

Write Acknowledged

Node 11

The last parameter throughput_mb_per_sec blocks further batches if we have written more than

that much in the past second.

Java Driver

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,1,1

5,4,1

8,4,1

3,9,1

PK=8

PK=3

PK=5

Node 11

The last parameter throughput_mb_per_sec blocks further batches if we have written more than

that much in the past second.

Java Driver

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,1,1

5,4,1

8,4,1

3,9,1

PK=8

PK=3

PK=5

Write Acknowledged

Node 11

The last parameter throughput_mb_per_sec blocks further batches if we have written more than

that much in the past second.

Java Driver

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

Block

3,1,1

5,4,1

8,4,1

3,9,1

PK=8

PK=3

PK=5

Node 11

The last parameter throughput_mb_per_sec blocks further batches if we have written more than

that much in the past second.

Java Driver

9,4,1

11,4, spark.cassandra.output.batch.grouping.key  partition spark.cassandra.output.batch.size.rows                4  spark.cassandra.output.batch.buffer.size            3  spark.cassandra.output.concurrent.writes            2 spark.cassandra.output.throughput_mb_per_sec    5

3,1,1

5,4,1

8,4,1

3,9,1

PK=8

PK=3

PK=5

Thanks for Coming and I hope you Have a Great Time At C* Summit

http://cassandrasummit-datastax.com/agenda/the-spark-cassandra-connector-past-present-and-future/

Also ask these guys really hard questions

Jacek PiotrAlex

top related