spark cassandra integration 2016

36
Spark-Cassandra Integration 2016 DuyHai DOAN Apache Cassandra Evangelist

Upload: duyhai-doan

Post on 19-Feb-2017

341 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Spark cassandra integration 2016

Spark-Cassandra Integration 2016 DuyHai DOANApache Cassandra Evangelist

Page 2: Spark cassandra integration 2016

@doanduyhai

Main use-cases

Load data from various sources

Analytics (join, aggregate, transform, …)

Sanitize, validate, normalize, transform data

Schema migration, Data conversion

Page 3: Spark cassandra integration 2016

@doanduyhai

Data import

3

•  Read data from CSV and dump into Cassandra ?

☞ Spark Job to distribute the import !

Load data from various sources

Page 4: Spark cassandra integration 2016

Demo

4

Page 5: Spark cassandra integration 2016

@doanduyhai

Data cleaning

5

Sanitize, validate, normalize, transform data

•  Bugs in your application ?•  Dirty input data ?

☞ Spark Job to clean it up!

Page 6: Spark cassandra integration 2016

Demo

6

Page 7: Spark cassandra integration 2016

@doanduyhai

Schema migration

7

•  Business requirements change with time ?•  Current data model no longer relevant ?

☞ Spark Job to migrate data !

Schema migration, Data conversion

Page 8: Spark cassandra integration 2016

Demo

8

Page 9: Spark cassandra integration 2016

@doanduyhai

Analytics

9

Given existing tables of performers and albums, I want:

①  top 10 most common music styles (pop,rock, RnB, …) ?

②  performer productivity(albums count) by origin country and by decade ?

☞ Spark Job to compute analytics !

Analytics (join, aggregate, transform, …)

Page 10: Spark cassandra integration 2016

Connector Architecture•  Cluster Deployment •  Data Locality •  Failure Handling •  Cross DC/cluster operations

Page 11: Spark cassandra integration 2016

@doanduyhai

Cluster Deployment

11

•  Stand-alone cluster

C* SparkM SparkW

C* SparkW

C* SparkW

C* SparkW

C* SparkW

Page 12: Spark cassandra integration 2016

@doanduyhai

Data Locality – remember token ranges ?

12

A : −x,−3x4

⎥⎥

⎥⎥

B : −3x4

,− 2x4

⎥⎥

⎥⎥

C : −2x4

,− x4

⎥⎥

⎥⎥

D : − x4

,0⎤

⎥⎥

⎥⎥

E : 0, x4

⎥⎥

⎥⎥

F : x4

,2x4

⎥⎥

⎥⎥

G : 2x4

,3x4

⎥⎥

⎥⎥

H : 3x4

,x⎤

⎥⎥

⎥⎥

C*

C*

C*

C*

C* C*

C* C*

Page 13: Spark cassandra integration 2016

@doanduyhai

Data Locality – how to

13

Spark partition RDD

Cassandra tokens ranges

C* SparkM SparkW

C* SparkW

C* SparkW

C* SparkW

C* SparkW

Page 14: Spark cassandra integration 2016

@doanduyhai

Data Locality – how to

14

C* SparkM SparkW

C* SparkW

C* SparkW

C* SparkW

C* SparkW

Page 15: Spark cassandra integration 2016

@doanduyhai

Perfect data locality scenario•  read localy from Cassandra•  use operations that do not require shuffle in Spark (map, filter, …)•  repartitionbyCassandraReplica()à to a table having same partition key as original table•  save back into this Cassandra table

Sanitize, validate, normalize, transform data USE CASE

15

Page 16: Spark cassandra integration 2016

@doanduyhai

Failure Handling

16

C* SparkM SparkW

C* SparkW

C* SparkW

C* SparkW

C* SparkW

What if 1 node down ?

What if 1 node overloaded ?

Page 17: Spark cassandra integration 2016

@doanduyhai

Failure Handling

17

C* SparkM SparkW

C* SparkW

C* SparkW

C* SparkW

C* SparkW

What if 1 node down ?

What if 1 node overloaded ?

☞ Spark master will re-assign the job to another worker

Page 18: Spark cassandra integration 2016

@doanduyhai

Failure Handling

18

Oh no, my data locality !!!

Page 19: Spark cassandra integration 2016

@doanduyhai

Failure Handling

19

Page 20: Spark cassandra integration 2016

@doanduyhai

Data Locality Impl

20

abstract'class'RDD[T](…)'{'' @DeveloperApi'' def'compute(split:'Partition,'context:'TaskContext):'Iterator[T]''' protected'def'getPartitions:'Array[Partition]'' '' protected'def'getPreferredLocations(split:'Partition):'Seq[String]'='Nil'''''''''''''''}'

Page 21: Spark cassandra integration 2016

@doanduyhai

CassandraRDD

21

def getPreferredLocations(split: Partition): Cassandra replicas IP address corresponding to this Spark partition

Page 22: Spark cassandra integration 2016

@doanduyhai

Failure Handling

22

If RF > 1 the Spark master chosesthe next preferred location, whichis a replica 😎

Tune parameters:•  spark.locality.wait •  spark.locality.wait.process •  spark.locality.wait.node

Page 23: Spark cassandra integration 2016

@doanduyhai

Failure Handling

23

If RF > 1 the Spark master chosesthe next preferred location, whichis a replica 😎

Tune parameters:•  spark.locality.wait •  spark.locality.wait.process •  spark.locality.wait.node

Only work for f

ixed

token rang

es (vnodes)

Page 24: Spark cassandra integration 2016

@doanduyhai

Cross cluster/DC operations

24

Page 25: Spark cassandra integration 2016

Tales from the field, SASI index benchmark•  Deployment automation •  Parallel ingestion •  Migrating data •  Spark + Cassandra 3.4 SASI index for topK query

Page 26: Spark cassandra integration 2016

@doanduyhai

Deployment Automation

26

Use Ansible to bootstrap a cluster•  role tools (install vim, htop, dstat, fio, jmxterm..)•  role Cassandra. Do not put all nodes as seeds …. •  role Spark (vanilla Spark). Slave on all nodes, master on a random node

DO NOT START ALL CASSANDRA NODES AT THE SAME TIME !!!! •  bootstrap first seeds nodes •  give ≥ 30secs between 2 node bootstrap for token range agreement•  watch -n 5 nodetool status

Page 27: Spark cassandra integration 2016

@doanduyhai

Parallel ingestion for SASI index benchmark

27

Hardware specs•  13 nodes•  6 cores CPU (HT) •  4 SSD in RAID 0 😎 •  64 Gb of RAM

Cassandra conf:•  G1GC 32Gb JVM Heap•  compaction throughput in MB = 256•  concurrent compactor = 2

Page 28: Spark cassandra integration 2016

@doanduyhai

Parallel ingestion for SASI index benchmark

28

Page 29: Spark cassandra integration 2016

@doanduyhai

Parallel ingestion for SASI index benchmark

29

3.2 billions

row in 17h

(compaction disable

d)

RF = 2

☞ ≈ 8000 ips

I/O idle, hi

gh CPU

Page 30: Spark cassandra integration 2016

@doanduyhai

Migrating Data

30

Page 31: Spark cassandra integration 2016

@doanduyhai

Migrating Data

31

Page 32: Spark cassandra integration 2016

@doanduyhai

TopK query

32

Pass 1, for each music provider•  sum albums sales count by title•  take top N, associate weight from descending order (1st = 1000, 2nd = 999 …)

Retrieve all albums from pass 1•  re-sum the sum(sales count) and weight group by title•  order again by sum(sales count) in descending order•  take top N

Page 33: Spark cassandra integration 2016

@doanduyhai

TopK query

33

Target data set = 3.2 billions rows•  minimum filter = 1 month (period_end_month = 201404 for ex)•  worst filter = 3 months range•  +8 other dynamic filters (music provider, distribution type …)

☞ SASI indices for filtering☞ Spark for aggregation

Page 34: Spark cassandra integration 2016

@doanduyhai

TopK query results

34

3.2 billions rows in total•  random distribution over 3 years (36 months) à 88 millions rows/month

Filters #rows Duration #rows/sec

3 months 376 947 612 14 mins (840 secs) 448 747

1 month 94 239 127 6.1 mins (366 secs) 257 483

1 month + 1 provider 7 267 983 2.1 mins (126 secs) 57 682

1 month + 1 provider + 1 country 2 737 178 1.5 mins (90 secs) 30 413

Page 35: Spark cassandra integration 2016

35

Q & A

! "

Page 36: Spark cassandra integration 2016

36

@doanduyhai

[email protected]

https://academy.datastax.com/

Thank You