breakthrough olap performance with cassandra and spark
TRANSCRIPT
![Page 1: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/1.jpg)
Breakthrough OLAPPerformance with
Cassandra and SparkEvan Chan
August 2015
![Page 2: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/2.jpg)
Who am I?
Distinguished Engineer, @evanfchan
User and contributor to Spark since 0.9, Cassandra since 0.6Co-creator and maintainer of
TupleJump
http://velvia.github.io
Spark Job Server
![Page 3: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/3.jpg)
About Tuplejump is a big data technology leader providing solutions for
rapid insights from data.Tuplejump
- the first Spark-Cassandra integration - an open source Lucene indexer for Cassandra - open source HDFS for Cassandra
CalliopeStargateSnackFS
![Page 4: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/4.jpg)
Didn't I attend the same talk last year?Similar title, but mostly new materialWill reveal new open source projects! :)
![Page 5: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/5.jpg)
Problem SpaceNeed analytical database / queries on structured big data
Something SQL-like, very flexible and fastPre-aggregation too limiting
Fast data / constant updatesIdeally, want my queries to run over fresh data too
![Page 6: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/6.jpg)
Example: Video analyticsTypical collection and analysis of consumer events3 billion new events every dayVideo publishers want updated stats, the sooner the betterPre-aggregation only enables simple dashboard UIsWhat if one wants to offer more advanced analysis, or ageneric data query API?
Eg, top countries filtered by device type, OS, browser
![Page 7: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/7.jpg)
RequirementsScalable - rules out PostGreSQL, etc.Easy to update and ingest new data
Not traditional OLAP cubes - that's not what I'm talkingabout
Very fast for analytical queries - OLAP not OLTPExtremely flexible queriesPreferably open source
![Page 8: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/8.jpg)
ParquetWidely used, lots of support (Spark, Impala, etc.)Problem: Parquet is read-optimized, not easy to use for writes
Cannot support idempotent writesOptimized for writing very large chunks, not small updatesNot suitable for time series, IoT, etc.Often needs multiple passes of jobs for compaction of smallfiles, deduplication, etc.
People really want a database-like abstraction, not a file format!
![Page 9: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/9.jpg)
Turns out this has been solved before!
Even .Facebook uses Vertica
![Page 10: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/10.jpg)
MPP Databases
Easy writes plus fast queries, with constant transfersAutomatic query optimization by storing intermediate queryprojectionsStonebraker, et. al. - paper (Brown Univ)CStore
![Page 11: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/11.jpg)
What's wrong with MPP Databases?Closed source$$$Usually don't scale horizontally that well (or cost is prohibitive)
![Page 12: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/12.jpg)
Cassandra
Horizontally scalableVery flexible data modelling (lists, sets, custom data types)Easy to operatePerfect for ingestion of real time / machine dataBest of breed storage technology, huge communityBUT: Simple queries onlyOLTP-oriented
![Page 13: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/13.jpg)
Apache Spark
Horizontally scalable, in-memory queriesFunctional Scala transforms - map, filter, groupBy, sortetc.SQL, machine learning, streaming, graph, R, many more pluginsall on ONE platform - feed your SQL results to a logisticregression, easy!Huge number of connectors with every single storagetechnology
![Page 14: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/14.jpg)
Spark provides the missing fast, deepanalytics piece of Cassandra!
![Page 15: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/15.jpg)
Spark and CassandraOLAP Architectures
![Page 16: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/16.jpg)
Separate Storage and Query LayersCombine best of breed storage and query platformsTake full advantage of evolution of eachStorage handles replication for availabilityQuery can replicate data for scaling read concurrency -independent!
![Page 17: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/17.jpg)
Spark as Cassandra's Cache
![Page 18: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/18.jpg)
Spark SQLAppeared with Spark 1.0In-memory columnar storeParquet, Json, Cassandra connector, Avro, many moreSQL as well as DataFrames (Pandas-style) APIIndexing integrated into data sources (eg C* secondaryindexes)Write custom functions in Scala .... take that Hive UDFs!!Integrates well with MLBase, Scala/Java/Python
![Page 19: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/19.jpg)
Connecting Spark to CassandraDatastax's Tuplejump
Spark Cassandra ConnectorCalliope
Get started in one line with spark-shell!bin/spark-shell \ --packages com.datastax.spark:spark-cassandra-connector_2.10:1.4.0-M3 \ --conf spark.cassandra.connection.host=127.0.0.1
![Page 20: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/20.jpg)
Caching a SQL Table from CassandraDataFrames support in Cassandra Connector 1.4.0 (and 1.3.0):
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read .format("org.apache.spark.sql.cassandra") .options(Map("table" -> "gdelt", "keyspace" -> "test")) .load()df.registerTempTable("gdelt")sqlContext.cacheTable("gdelt")sqlContext.sql("SELECT count(monthyear) FROM gdelt").show()
Spark does no caching by default - you will always be readingfrom C*!
![Page 21: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/21.jpg)
How Spark SQL's Table Caching Works
![Page 22: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/22.jpg)
Spark Cached Tables can be Really FastGDELT dataset, 4 million rows, 60 columns, localhost
Method secsUncached 317
Cached 0.38
Almost a 1000x speedup!
On an 8-node EC2 c3.XL cluster, 117 million rows, can runcommon queries 1-2 seconds against cached dataset.
![Page 23: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/23.jpg)
Tuning Connector Partitioningspark.cassandra.input.split.size
Guideline: One split per partition, one partition per CPU core
Much more parallelism won't speed up job much, but willstarve other C* requests
![Page 24: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/24.jpg)
Lesson #1: Take Advantage of SparkCaching!
![Page 25: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/25.jpg)
Problems with Cached TablesStill have to read the data from Cassandra first, which is slowAmount of RAM: your entire data + extra for conversion tocached tableCached tables only live in Spark executors - by default
tied to single context - not HAonce any executor dies, must re-read data from C*
Caching takes time: convert from RDD[Row] to compressedcolumnar formatCannot easily combine new RDD[Row] with cached tables(and keep speed)
![Page 26: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/26.jpg)
Problems with Cached TablesIf you don't have enough RAM, Spark can cache your tablespartly to disk. This is still way, way, faster than scanning an entireC* table. However, cached tables are still tied to a single Sparkcontext/application.
Also: rdd.cache() is NOT the same as SQLContext'scacheTable!
![Page 27: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/27.jpg)
What about C* Secondary Indexing?Spark-Cassandra Connector and Calliope can both reduce I/O byusing Cassandra secondary indices. Does this work with caching?
No, not really, because only the filtered rows would be cached.Subsequent queries against this limited cached table would notgive you expected results.
![Page 28: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/28.jpg)
Tachyon Off-Heap Caching
![Page 29: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/29.jpg)
Intro to TachyonTachyon: an in-memory cache for HDFS and other binary datasourcesKeeps data off-heap, so multiple Spark applications/executorscan share dataSolves HA problem for data
![Page 30: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/30.jpg)
Wait, wait, wait!What am I caching exactly? Tachyon is designed for caching filesor binary blobs.
A serialized form of CassandraRow/CassandraRDD?Raw output from Cassandra driver?
What you really want is this:
Cassandra SSTable -> Tachyon (as row cache) -> CQL -> Spark
![Page 31: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/31.jpg)
Bad programmers worry about the code. Goodprogrammers worry about data structures. - Linus Torvalds
Are we really thinking holistically about data modelling, caching,and how it affects the entire systems architecture?
![Page 32: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/32.jpg)
Efficient Columnar Storage in CassandraWait, I thought Cassandra was columnar?
![Page 33: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/33.jpg)
How Cassandra stores your CQL TablesSuppose you had this CQL table:
CREATE TABLE ( department text, empId text, first text, last text, age int, PRIMARY KEY (department, empId));
![Page 34: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/34.jpg)
How Cassandra stores your CQL TablesPartitionKey 01:first 01:last 01:age 02:first 02:last 02:ageSales Bob Jones 34 Susan O'Connor 40
Engineering Dilbert P ? Dogbert Dog 1
Each row is stored contiguously. All columns in row 2 come afterrow 1.
To analyze only age, C* still has to read every field.
![Page 35: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/35.jpg)
Cassandra is really a row-based, OLTP-oriented datastore.
Unless you know how to use it otherwise :)
![Page 36: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/36.jpg)
The traditional row-based data storageapproach is dead- Michael Stonebraker
![Page 37: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/37.jpg)
Columnar Storage (Memory)Name column
0 10 1
Dictionary: {0: "Barak", 1: "Hillary"}
Age column
0 146 66
![Page 38: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/38.jpg)
Columnar Storage (Cassandra)Review: each physical row in Cassandra (e.g. a "partition key")stores its columns together on disk.
Schema CF
Rowkey TypeName StringDict
Age Int
Data CF
Rowkey 0 1Name 0 1
Age 46 66
![Page 39: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/39.jpg)
Columnar Format solves I/OCompression
Dictionary compression - HUGE savings for low-cardinalitystring columnsRLE, other techniques
Reduce I/OOnly columns needed for query are loaded from disk
Batch multiple rows in one cell for efficiency (avoid cluster keyoverhead)
![Page 40: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/40.jpg)
Columnar Format solves CachingUse the same format on disk, in cache, in memory scan
Caching works a lot better when the cached object is thesame!!
No data format dissonance means bringing in new bits of dataand combining with existing cached data is seamless
![Page 41: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/41.jpg)
So, why isn't everybody doing this?No columnar storage format designed to work with NoSQLstoresEfficient conversion to/from columnar format a hard problemMost infrastructure is still row oriented
Spark SQL/DataFrames based on RDD[Row]Spark Catalyst is a row-oriented query parser
![Page 42: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/42.jpg)
All hard work leads to profit, but mere talk leadsto poverty.- Proverbs 14:23
![Page 43: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/43.jpg)
![Page 44: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/44.jpg)
Columnar Storage Performance Study
http://github.com/velvia/cassandra-gdelt
![Page 45: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/45.jpg)
GDELT Dataset1979 to now
60 columns, 250 million+ rows, 250GB+Let's compare Cassandra I/O only, no caching or Spark
Global Database of Events, Language, and Tone
![Page 46: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/46.jpg)
The scenarios1. Narrow table - CQL table with one row per partition key2. Wide table - wide rows with 10,000 logical rows per partition
key3. Columnar layout - 1000 rows per columnar chunk, wide rows,
with dictionary compressionFirst 4 million rows, localhost, SSD, C* 2.0.9, LZ4 compression.Compaction performed before read benchmarks.
![Page 47: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/47.jpg)
Query and ingest timesScenario Ingest Read all
columnsRead onecolumn
Narrowtable
1927sec
505 sec 504 sec
Widetable
3897sec
365 sec 351 sec
Columnar 93 sec 8.6 sec 0.23 sec
On reads, using a columnar format is up to 2190x faster, whileingestion is 20-40x faster.
Of course, real life perf gains will depend heavily on query,table width, etc. etc.
![Page 48: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/48.jpg)
Disk space usageScenario Disk usedNarrow table 2.7 GB
Wide table 1.6 GB
Columnar 0.34 GBThe disk space usage helps explain some of the numbers.
![Page 49: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/49.jpg)
Towards Extreme Query Performance
![Page 50: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/50.jpg)
The filo project is a binary data vector library
designed for extreme read performance with minimaldeserialization costs.
http://github.com/velvia/filo
Designed for NoSQL, not a file formatrandom or linear accesson or off heapmissing value supportScala only, but cross-platform support possible
![Page 51: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/51.jpg)
What is the ceiling?This Scala loop can read integers from a binary Filo blob at a rateof 2 billion integers per second - single threaded:
def sumAllInts(): Int = { var total = 0 for { i <- 0 until numValues optimized } { total += sc(i) } total }
![Page 52: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/52.jpg)
Vectorization of Spark QueriesThe project.Tungsten
Process many elements from the same column at once, keep datain L1/L2 cache.
Coming in Spark 1.4 through 1.6
![Page 53: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/53.jpg)
Hot Column Caching in TachyonHas a "table" feature, originally designed for SharkKeep hot columnar chunks in shared off-heap memory for fastaccess
![Page 54: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/54.jpg)
Introducing FiloDB
http://github.com/velvia/FiloDB
![Page 55: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/55.jpg)
What's in the name?
Rich sweet layers of distributed, versioned database goodness
![Page 56: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/56.jpg)
DistributedApache Cassandra. Scale out with no SPOF. Cross-datacenterreplication. Proven storage and database technology.
![Page 57: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/57.jpg)
VersionedIncrementally add a column or a few rows as a new version. Easilycontrol what versions to query. Roll back changes inexpensively.
Stream out new versions as continuous queries :)
![Page 58: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/58.jpg)
ColumnarParquet-style storage layoutRetrieve select columns and minimize I/O for OLAP queriesAdd a new column without having to copy the whole tableVectorization and lazy/zero serialization for extremeefficiency
![Page 59: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/59.jpg)
100% ReactiveBuilt completely on the Typesafe Platform:
Scala 2.10 and SBTSpark (including custom data source)Akka Actors for rational scale-out concurrencyFutures for I/OPhantom Cassandra client for reactive, type-safe C* I/OTypesafe Config
![Page 60: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/60.jpg)
Spark SQL Queries!SELECT first, last, age FROM customers WHERE _version > 3 AND age < 40 LIMIT 100
Read to and write from Spark DataframesAppend/merge to FiloDB table from Spark Streaming
![Page 61: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/61.jpg)
FiloDB vs ParquetComparable read performance - with lots of space to improve
Assuming co-located Spark and CassandraOn localhost, both subsecond for simple queries (GDELT1979-1984)FiloDB has more room to grow - due to hot column cachingand much less deserialization overhead
Lower memory requirement due to much smaller block sizesMuch better fit for IoT / Machine / Time-series applicationsLimited support for types
array / set / map support not there, but will be added later
![Page 62: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/62.jpg)
Where FiloDB Fits InUse regular C* denormalized tables for OLTP and single-keylookupsUse FiloDB for the remaining ad-hoc or more complexanalytical queriesSimplify your analytics infrastructure!
No need to export to Hadoop/Parquet/data warehouse.Use Spark and C* for both OLAP and OLTP!
Perform ad-hoc OLAP analysis of your time-series, IoT data
![Page 63: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/63.jpg)
Simplify your Lambda Architecture...
( )https://www.mapr.com/developercentral/lambda-architecture
![Page 64: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/64.jpg)
With Spark, Cassandra, and FiloDB
Ma, where did all the components go?You mean I don't have to deal with Hadoop?Use Cassandra as a front end to store IoT data first
![Page 65: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/65.jpg)
Exactly-Once Ingestion from Kafka
New rows appended via KafkaWrites are idempotent - no need to dedup!Converted to columnar chunks on ingest and stored in C*Only necessary columnar chunks are read into Spark forminimal I/O
![Page 66: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/66.jpg)
You can help!Send me your use cases for OLAP on Cassandra and Spark
Especially IoT and GeospatialEmail if you want to contribute
![Page 67: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/67.jpg)
Thanks...to the entire OSS community, but in particular:
Lee Mighdoll, Nest/GoogleRohit Rai and Satya B., TuplejumpMy colleagues at Socrata
If you want to go fast, go alone. If you want to gofar, go together.-- African proverb
![Page 68: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/68.jpg)
DEMO TIMEGDELT: Regular C* Tables vs FiloDB
![Page 69: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/69.jpg)
Extra Slides
![Page 70: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/70.jpg)
When in doubt, use brute force- Ken Thompson
![Page 71: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/71.jpg)
Automatic Columnar Conversion usingCustom Indexes
Write to Cassandra as you normally doCustom indexer takes changes, merges and compacts intocolumnar chunks behind scenes
![Page 72: Breakthrough OLAP performance with Cassandra and Spark](https://reader033.vdocuments.net/reader033/viewer/2022052401/55d6d617bb61ebc60b8b461b/html5/thumbnails/72.jpg)
Implementing Lambda is HardUse real-time pipeline backed by a KV store for new updatesLots of moving parts
Key-value store, real time sys, batch, etc.Need to run similar code in two placesStill need to deal with ingesting data to Parquet/HDFSNeed to reconcile queries against two different places