mixing batch and real-time: cassandra with shark (cassandra europe 2013)

49
#CASSANDRAEU CASSANDRASUMMITEU Richard Low | @richardalow Mixing Batch and Real-time: Cassandra with Shark

Upload: richard-low

Post on 06-May-2015

2.121 views

Category:

Technology


6 download

DESCRIPTION

Everything Cassandra does is designed for a real-time workload of high volume inserts and frequent small queries. Cassandra has Hadoop and Hive integration, but performing long running ad-hoc queries with these tools is difficult without impacting real-time performance or requires duplicate clusters. This talk will explain how I'm integrating Cassandra with Shark, a drop-in Hive replacement developed by Berkeley's AmpLab. It's designed to give fine grained control over all resource usage so you can safely run arbitrary ad-hoc queries on your existing cluster with controlled and predictable impact.

TRANSCRIPT

Page 1: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU CASSANDRASUMMITEU

Richard Low | @richardalow

Mixing Batch and Real-time: Cassandra with Shark

Page 2: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

About me*Analytics tech lead at SwiftKey*Cassandra freelancer*Previous: lead Cassandra and Analytics dev at

Acunu

Page 3: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Outline*Batch analytics on real-time databases*Current solutions*Spark and Shark*My solution*Performance results*Summary & future work

Page 4: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Batch analytics on real-time databases

Page 5: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Batch and real-time analytics*Wherever there is data there are unforeseeable

queries*Real-time databases are optimized for real-time

queries*Large queries may not be possible*Or will impact your real-time SLA

Page 6: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Example*User accounts database*Read-heavy*Must be low latency*Other tables on same database*Some are write heavy*A good fit for Cassandra!

Page 7: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Example data model

CREATE TABLE user_accounts ( userid uuid PRIMARY KEY, username text, email text, password text, last_visited timestamp, country text);

Page 8: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Example data modelSELECT * FROM user_accounts LIMIT 2;

userid | country | email | last_visited | password | username---------+---------+---------------------+---------------------+----------+---------a03dcf03 | UK | [email protected] | 2013-10-07 09:07:36 | td7rjxwp | rlowb3f1871e | FR | [email protected] | 2013-08-17 13:07:36 | moh7eksn | jean88

Page 9: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Marketing walks in

Page 10: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Ad-hoc query

“Please can you find all users from Brazil who haven’t logged in since July and have an email @yahoo.com.

I need the answer by Monday.”

Page 11: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Ad-hoc query observations*We have 500k users from Brazil*60MB of raw data*No way to extract by country from data model*It’s on unchanging data**Can take hours, not days*No expectation this query will need rerunning

* Mostly, some of the people who haven’t visited for a while may suddenly come back

Page 12: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Why?*Underrepresented use case in plethora of tools*Seen days of dev time wasted*Want to see what can be done

Page 13: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Current solutions

Page 14: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Options*Run Hive query on top of Cassandra

Page 15: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

*Run Hive query on top of CassandraOptions*Run Hive query on top of Cassandra

*Will compete with Cassandra for*I/O*Memory*CPU*Network

*Will cause extra GC pressure on Cassandra*Could flush filesystem cache

Page 16: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Options*Write ETL script and load into another DB

Page 17: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Options*Write ETL script and load into another DB*Write ETL script and load into another DB

*All custom code*Single threaded*Unreliable*Will still flush cache on Cassandra nodes

Page 18: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Options*Clone the cluster

Page 19: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Options*Clone the cluster*Clone the cluster

*Worst possible network load*Manual import each time*No incremental update*Need duplicate hardware

Page 20: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Options*Add ‘batch analytics’ DC and run Hive there

Page 21: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Options*Add ‘batch analytics’ DC and run Hive there*Add ‘batch analytics’ DC and run Hive there

*Initial copy slow and affects real-time performance

*Need duplicate hardware*Will drop writes when really busy

Page 22: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Spark and Shark

Page 23: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Spark*Developed by Amplab*Distributed computation, like Hadoop*Designed for iterative algorithms*Much faster for queries with working sets that fit

in RAM*Reliability from storing lineage rather than

intermediate results*Runs on Mesos or YARN

Page 24: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Spark is used by

Source: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

Page 25: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Shark*Hive on Spark*Completely compatible with Hive*Same QL, UDFs and storage handlers*Can cache tables

Page 26: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Shark*Hive on Spark*Completely compatible with Hive*Same QL, UDFs and storage handlers*Can cache tables

CREATE TABLE user_accounts_cached as SELECT * FROM user_accounts WHERE country = ‘BR’;

Page 27: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Shark on Cassandra

Page 28: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Shark on Cassandra* CqlStorageHandler*Can use existing hive-cassandra storage handler*Can work well - see Evan Chan’s talk (Ooyala) from

#cassandra13*But suffers from same problems as Hive+Hadoop

on Cassandra

Page 29: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Shark on Cassandra direct* SSTableStorageHandler*Run spark workers on the Cassandra nodes*Read directly from SSTables in separate JVM*Limit CPU and memory through Spark/Mesos/

YARN*Limit I/O by rate limiting raw disk access*Skip filesystem cache

Page 30: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Cassandra on Spark: through CQL interface

Cassandra JVM

Spark worker JVM

DeserializeMergeSerialize

DeserializeProcess

Remote client

FS CacheSSTables

Latency spikes!

Page 31: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Cassandra on Spark: SSTables direct

Cassandra JVM

Spark worker JVM

DeserializeMergeSerialize

DeserializeProcess

Remote client

SSTables

Constant latency

FS Cache

Page 32: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Disadvantages*Equivalent to CL.ONE*Always runs task local with the data*Doesn’t read data in memtables

Page 33: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Performance results

Page 34: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Testing*4 node Cassandra cluster on m1.large

*2 cores, 7.5 GB RAM, 2 ephemeral disks*1 spark master*Spark running on Cassandra nodes*Limited to 1 core, 1 GB RAM*Compare CQLStorageHandler with

SSTableStorageHandler

Page 35: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Setup*Cassandra 1.2.10*3 GB heap*256 tokens per node*RF 3*Preloaded 100M randomly generated records

*Each node started with 9GB of data*No optimization or tuning

Page 36: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Tools*codahale Metrics*Ganglia*Load generator using DataStax Java driver*Google spreadsheet

Page 37: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Result 1*No Cassandra load*Run caching query:

*Takes 33 mins through CQL*Takes 13 mins through SSTables

*130k records/s*=> SSTables is 2.5x faster*Even better since CQL has access to both cores

CREATE TABLE user_accounts_cached as SELECT * FROM user_accounts WHERE country = ‘BR’;

Page 38: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Using cached results*Now have results cached, can run super fast

queries*No I/O or extra memory*Bounded number of cores

*Took 18 seconds

SELECT count(*) FROM user_accounts_cached WHERE unix_timestamp(last_visited)< unix_timestamp('2013-08-01 00:00:00') AND email LIKE '%@c9%';

Page 39: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Result 2*Add read load

*Read-modify-write of accounts info*200 ops/s*Measure latency

*Slow down SSTable loader to same rate as CQL

Page 40: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

95%ile base

mean base

Page 41: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Analysis*Average latency 17% lower

*Probably due to less CPU used by query*Max 95th %ile latency 33% lower and much more

predictable*Possibly due to less GC pressure

*Still have a latency increase over base*Probably due to I/O use

Page 42: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Result 3*Keep read workload*Measure same latency*Add insert workload

*Insert into separate table*2500 ops/s

Page 43: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

CQL loader SSTable loader

Page 44: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Analysis*Lots of latency, but there is anyway

Page 45: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Performance wrap up*2.5x faster with less CPU

=> uses less resources to do the same thing*Lower, more predictable latencies when at same

speed=> controlled resource usage lowers latency impact

*Could limit further to make impact unnoticeable

Page 46: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Summary

Page 47: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Summary*Discussed analytics use case not well served by

current tools*Spark, Shark*SSTableStorageHandler*Performance results

Page 48: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Future*Needs a name*Github*Speak to me if you want to use it*Speak to me if you want to contribute

Page 49: Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

#CASSANDRAEU @richardalow

Thank you!Richard Low | @richardalow