c* summit 2013: high throughput analytics with cassandra by aaron stannard
DESCRIPTION
Building analytics systems is an increasingly common requirement for BI teams inside companies both big and small, and a feat made even more challenging when analytic results have to be produced in real-time. In this presentation the team from MarkedUp Analytics will show you techniques for leveraging Cassandra, Hadoop, and Hive to build a manageable and scalable analytics system capable of handling a wide range of business cases and needs.TRANSCRIPT
![Page 1: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/1.jpg)
Real Time Analytics with Cassandra, Hive, and Solr
![Page 2: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/2.jpg)
Real Time Analytics with Cassandra, Hive, and Solr Aaron Stannard, Founder & CEO of MarkedUp
![Page 3: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/3.jpg)
Powerful analytics tools for native apps
Understand your audience.
Gain valuable data on your users.
Monitor your app’s health.Log errors and crashes
remotely.
Drive more sales.
Better data = more revenue.
![Page 4: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/4.jpg)
![Page 5: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/5.jpg)
Do we really need real-time analytics?
![Page 6: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/6.jpg)
![Page 7: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/7.jpg)
Real time analytics isn’t inherently superior or necessary.
![Page 8: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/8.jpg)
![Page 9: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/9.jpg)
Building your own real-time analytics service with Cassandra and DataStax Enterprise
![Page 10: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/10.jpg)
Cassandra Setup on EC2
![Page 11: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/11.jpg)
Write Strategy
![Page 12: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/12.jpg)
Read Strategy
![Page 13: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/13.jpg)
Analytics Schema Strategy
• All row keys should be predictable (not always possible)
• U8lize physical sortability of columns
• Use predictably sortable data types for column names (integers, dates)
• Learn to love composite keys
• Batch muta8ons are your friend
• Use distributed counters for real-‐8me metrics
• Use TTL for automa8on data expira8on (if necessary)
![Page 14: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/14.jpg)
Time Series Schema 0: All Knowns
![Page 15: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/15.jpg)
Time Series Schema 1: Bounded Number of Unknowns
![Page 16: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/16.jpg)
Time Series Schema 2: Unbounded Number of Unknowns
![Page 17: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/17.jpg)
Schema Tips
![Page 18: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/18.jpg)
Adding Hive and Hadoop to the Mix
Mo’ data, mo’ problems
![Page 19: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/19.jpg)
When is Hadoop necessary? • Large volumes of data (100GB+)
• Queries require retrospective / historical analysis
• Need consistent results
• Need to perform multi-stage analysis
• Speed isn’t a concern (Hadoop is sloooooooooow)
![Page 20: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/20.jpg)
Hadoop on easy mode: Hive • SQL abstraction on top of Hadoop (more familiar)
• Easier to deploy and test
• Simplifies data warehousing
• Easy to automatically import data from Cassandra
• DSE eliminates need for HDFS
![Page 21: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/21.jpg)
C* to Hive
![Page 22: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/22.jpg)
Hive Syntax
Query: count the number items where “key” is greater than 100 RDBMS> select key, count(1) from kv1 where key > 100 group by key; Hive> select key, count(1) from kv1 where key > 100 group by key;
![Page 23: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/23.jpg)
Hive Tips and Tricks
• Don’t write data from Hive back to a hot Cassandra column family • If writing data from Hive to Cassandra, use dedicated column
families • You can write to multiple places on a single Hive read (table, CSV
file, etc…) • Use sampling to test Hive queries on scaled-down data sets
![Page 24: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/24.jpg)
How do you count millions of distinct items in real-time?
![Page 25: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/25.jpg)
• Solr: Lucene-‐based indexing engine • Part of Apache Founda8on • Full-‐text search • Faceted search • Distributed • Integrates well with Cassandra
![Page 26: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/26.jpg)
Solr Index Setup
![Page 27: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard](https://reader033.vdocuments.net/reader033/viewer/2022051817/547c50ed5906b57c798b472e/html5/thumbnails/27.jpg)
Solr Search