capital one: using cassandra in building a reporting platform
TRANSCRIPT
Using Cassandra In Building A Reporting Platform
Javed Roshan – Director, Data Services Mukaram Aziz – Sr. Manager, Data Services
1 Use Case
2 New Data Platform
3 Design Decisions
4 Solution Stack
5 Challenges
2 © 2015. All Rights Reserved.
Use Case
• Fast Data requirements in an Operational Space – Metrics and Reports for intra-day business decisions – Process Monitoring
• Current Landscape – Multiple data sources – Traditional batched ETL – Multiple data destinations – Reporting Tools
• Opportunity Areas – Make reports near real time – Achieve 99.99% SLAs – Time to market delivery – Make enhancements inexpensive
3 © 2015. All Rights Reserved.
Existing
Data Sources RDBMS, Files
ETL File Based
Data Distribution Files
Data Destination RDBMS
Reporting Tools Various
New Data Platform
• Platform – Data Distribution: Kafka – Data Processing: Go / Docker – Data Store: Cassandra – AWS
• Design Decisions – Move data when available – Transform when all data available
• Cassandra – CAP: Emphasis on A & P with tunable C – Wide row tables – Linear scalability to handle large data sizes – Out of the box multi-DC deployment
4 © 2015. All Rights Reserved.
Existing New
Data Sources RDBMS, Files RDBMS, Files
ETL File based Go / Docker
Data Distribution Files Kafka
Data Destination RDBMS Cassandra
Reporting Tools Various Streamlined
Design Decisions
• Data Modeling – Partition Key/Size – “Read” Response time – Handling Consistency – Collection Columns: Sets & Maps – Logical separation of raw & processed data – All lookup data in a single table
• Indexes – Primary, Inverted, Secondary, DSE Search Indexes
• DSE Search – range-queries – regular-expression – non-equality – faceted
5 © 2015. All Rights Reserved.
Design Decisions
• Consistency – W Consistency + R Consistency > Replication Factor
• Indexes
6 © 2015. All Rights Reserved.
Data Access Options Data / Index Storage Response
Time Maintain Cardinality Search Consistency
Primary Key Data High V Fast App High Limited Tunable
Duplicate Data (Primary Key) Data High V Fast App High Limited Tunable
Inverted Index Index Low Fast App High Limited Tunable
Secondary Index Index Low Medium System Low Limited Tunable
DSE Search Index Medium Slow (relative) System Any Versatile One (R)
Benchmarking: Indexes
0.02 0.04
1.6
0.6 0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Primary Index Inverted Index DSE Search Secondary Index
7 © 2015. All Rights Reserved.
Tim
e in
sec
onds
Index Type
• 22.6 million rows • 6 node cluster
Performance
• 3 Replication Factor • Write Heavy
– Increased concurrent writes to 64 (from 32) – Decreased concurrent reads to 16 (from 32) – Size-tiered compaction strategy
• Cassandra cluster with DSE Search enabled on all nodes • Virtual nodes set to 16 • All caches disabled except filter cache • EC2 Snitch on AWS – 3 AZs • DSE Search soft auto-commit max time to 10s
8 © 2015. All Rights Reserved.
Solution Stack
9 © 2015. All Rights Reserved.
MESOSPHERE
MARATHON
DOCKER
GO SERVICESINGESTION
DOCKER
GO SERVICESPROCESSORKAFKA
API SERVER
CONSUL
Solution Stack: Plug-In Framework
10 © 2015. All Rights Reserved.
• Go Service: Plugins chained in a single process • Packaged & deployed in a Docker Container • Bootstrapped from a config • 100% developed in-house
RUNNER
PLUGIN
INCHANNEL
OUTCHANNEL
RUNNER
PLUGIN
INCHANNEL
OUTCHANNEL
RUNNER
PLUGIN
INCHANNEL
OUTCHANNEL
GO SERVICE
Solution Stack
11 © 2015. All Rights Reserved.
MESOSPHERE
MARATHON
DOCKER
GO SERVICESINGESTION
DOCKER
GO SERVICESPROCESSORKAFKA
API SERVER
CONSUL
Solution Stack
12 © 2015. All Rights Reserved.
MESOSPHERE
MARATHON
DOCKER
GO SERVICESINGESTION
DOCKER
GO SERVICESPROCESSORKAFKA
API SERVER
CONSUL
Solution Stack
13 © 2015. All Rights Reserved.
MESOSPHERE
MARATHON
DOCKER
GO SERVICESINGESTION
DOCKER
GO SERVICESPROCESSORKAFKA
API SERVER
CONSUL
Solution Stack
14 © 2015. All Rights Reserved.
MESOSPHERE
MARATHON
DOCKER
GO SERVICESINGESTION
DOCKER
GO SERVICESPROCESSORKAFKA
API SERVER
CONSUL
Solution Stack
15 © 2015. All Rights Reserved.
MESOSPHERE
MARATHON
DOCKER
GO SERVICESINGESTION
DOCKER
GO SERVICESPROCESSORKAFKA
API SERVER
CONSUL
16 © 2015. All Rights Reserved.
• Cassandra • Data Storage
• Go-Based Plugin Framework • Go services for data Ingestion & Processing
• Docker • Packaging and deployment
• Mesosphere • Single view of infrastructure
• Marathon • Launch containers
• Kafka • Data transfer and distribution
• Consul • Service discovery and configuration management
• Jenkins • Continuous Integration
Solution Stack
Benchmarking: Data Processing
17 © 2015. All Rights Reserved.
• Test for a functional group • Cassandra: 6 node cluster • Kafka: 6 node cluster • Go Services: 3 • Primary Data Source: Oracle • Time: 360 minutes • Data Size: 1 year
Description Measure
Total rows processed 450 million
De-normalized rows 11.8 million
Rate of processing (Go Services) ~300k tps
Rate of processing (Platform) ~21k tps
% time waiting on data ingestion 75%
Challenges
• Not all query patterns are known in advance • Index rebuilds are costly • Business adjusting to near real-time data • Operational support adjustments • Backup/Restore • Finding Talent – We are hiring!
18 © 2015. All Rights Reserved.
Thank you