capital one: using cassandra in building a reporting platform

19
Using Cassandra In Building A Reporting Platform Javed Roshan – Director, Data Services Mukaram Aziz – Sr. Manager, Data Services

Upload: datastax-academy

Post on 13-Feb-2017

1.555 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Capital One: Using Cassandra In Building A Reporting Platform

Using Cassandra In Building A Reporting Platform

Javed Roshan – Director, Data Services Mukaram Aziz – Sr. Manager, Data Services

Page 2: Capital One: Using Cassandra In Building A Reporting Platform

1 Use Case

2 New Data Platform

3 Design Decisions

4 Solution Stack

5 Challenges

2 © 2015. All Rights Reserved.

Page 3: Capital One: Using Cassandra In Building A Reporting Platform

Use Case

•  Fast Data requirements in an Operational Space –  Metrics and Reports for intra-day business decisions –  Process Monitoring

•  Current Landscape –  Multiple data sources –  Traditional batched ETL –  Multiple data destinations –  Reporting Tools

•  Opportunity Areas –  Make reports near real time –  Achieve 99.99% SLAs –  Time to market delivery –  Make enhancements inexpensive

3 © 2015. All Rights Reserved.

Existing

Data Sources RDBMS, Files

ETL File Based

Data Distribution Files

Data Destination RDBMS

Reporting Tools Various

Page 4: Capital One: Using Cassandra In Building A Reporting Platform

New Data Platform

•  Platform –  Data Distribution: Kafka –  Data Processing: Go / Docker –  Data Store: Cassandra –  AWS

•  Design Decisions –  Move data when available –  Transform when all data available

•  Cassandra –  CAP: Emphasis on A & P with tunable C –  Wide row tables –  Linear scalability to handle large data sizes –  Out of the box multi-DC deployment

4 © 2015. All Rights Reserved.

Existing New

Data Sources RDBMS, Files RDBMS, Files

ETL File based Go / Docker

Data Distribution Files Kafka

Data Destination RDBMS Cassandra

Reporting Tools Various Streamlined

Page 5: Capital One: Using Cassandra In Building A Reporting Platform

Design Decisions

•  Data Modeling –  Partition Key/Size –  “Read” Response time –  Handling Consistency –  Collection Columns: Sets & Maps –  Logical separation of raw & processed data –  All lookup data in a single table

•  Indexes –  Primary, Inverted, Secondary, DSE Search Indexes

•  DSE Search –  range-queries –  regular-expression –  non-equality –  faceted

5 © 2015. All Rights Reserved.

Page 6: Capital One: Using Cassandra In Building A Reporting Platform

Design Decisions

•  Consistency –  W Consistency + R Consistency > Replication Factor

•  Indexes

6 © 2015. All Rights Reserved.

Data Access Options Data / Index Storage Response

Time Maintain Cardinality Search Consistency

Primary Key Data High V Fast App High Limited Tunable

Duplicate Data (Primary Key) Data High V Fast App High Limited Tunable

Inverted Index Index Low Fast App High Limited Tunable

Secondary Index Index Low Medium System Low Limited Tunable

DSE Search Index Medium Slow (relative) System Any Versatile One (R)

Page 7: Capital One: Using Cassandra In Building A Reporting Platform

Benchmarking: Indexes

0.02 0.04

1.6

0.6 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Primary Index Inverted Index DSE Search Secondary Index

7 © 2015. All Rights Reserved.

Tim

e in

sec

onds

Index Type

•  22.6 million rows •  6 node cluster

Page 8: Capital One: Using Cassandra In Building A Reporting Platform

Performance

•  3 Replication Factor •  Write Heavy

–  Increased concurrent writes to 64 (from 32) –  Decreased concurrent reads to 16 (from 32) –  Size-tiered compaction strategy

•  Cassandra cluster with DSE Search enabled on all nodes •  Virtual nodes set to 16 •  All caches disabled except filter cache •  EC2 Snitch on AWS – 3 AZs •  DSE Search soft auto-commit max time to 10s

8 © 2015. All Rights Reserved.

Page 9: Capital One: Using Cassandra In Building A Reporting Platform

Solution Stack

9 © 2015. All Rights Reserved.

MESOSPHERE

MARATHON

DOCKER

GO  SERVICESINGESTION

DOCKER

GO  SERVICESPROCESSORKAFKA

API  SERVER

CONSUL

Page 10: Capital One: Using Cassandra In Building A Reporting Platform

Solution Stack: Plug-In Framework

10 © 2015. All Rights Reserved.

•  Go Service: Plugins chained in a single process •  Packaged & deployed in a Docker Container •  Bootstrapped from a config •  100% developed in-house

RUNNER

PLUGIN

INCHANNEL

OUTCHANNEL

RUNNER

PLUGIN

INCHANNEL

OUTCHANNEL

RUNNER

PLUGIN

INCHANNEL

OUTCHANNEL

GO  SERVICE

Page 11: Capital One: Using Cassandra In Building A Reporting Platform

Solution Stack

11 © 2015. All Rights Reserved.

MESOSPHERE

MARATHON

DOCKER

GO  SERVICESINGESTION

DOCKER

GO  SERVICESPROCESSORKAFKA

API  SERVER

CONSUL

Page 12: Capital One: Using Cassandra In Building A Reporting Platform

Solution Stack

12 © 2015. All Rights Reserved.

MESOSPHERE

MARATHON

DOCKER

GO  SERVICESINGESTION

DOCKER

GO  SERVICESPROCESSORKAFKA

API  SERVER

CONSUL

Page 13: Capital One: Using Cassandra In Building A Reporting Platform

Solution Stack

13 © 2015. All Rights Reserved.

MESOSPHERE

MARATHON

DOCKER

GO  SERVICESINGESTION

DOCKER

GO  SERVICESPROCESSORKAFKA

API  SERVER

CONSUL

Page 14: Capital One: Using Cassandra In Building A Reporting Platform

Solution Stack

14 © 2015. All Rights Reserved.

MESOSPHERE

MARATHON

DOCKER

GO  SERVICESINGESTION

DOCKER

GO  SERVICESPROCESSORKAFKA

API  SERVER

CONSUL

Page 15: Capital One: Using Cassandra In Building A Reporting Platform

Solution Stack

15 © 2015. All Rights Reserved.

MESOSPHERE

MARATHON

DOCKER

GO  SERVICESINGESTION

DOCKER

GO  SERVICESPROCESSORKAFKA

API  SERVER

CONSUL

Page 16: Capital One: Using Cassandra In Building A Reporting Platform

16 © 2015. All Rights Reserved.

•  Cassandra •  Data Storage

•  Go-Based Plugin Framework •  Go services for data Ingestion & Processing

•  Docker •  Packaging and deployment

•  Mesosphere •  Single view of infrastructure

•  Marathon •  Launch containers

•  Kafka •  Data transfer and distribution

•  Consul •  Service discovery and configuration management

•  Jenkins •  Continuous Integration

Solution Stack

Page 17: Capital One: Using Cassandra In Building A Reporting Platform

Benchmarking: Data Processing

17 © 2015. All Rights Reserved.

•  Test for a functional group •  Cassandra: 6 node cluster •  Kafka: 6 node cluster •  Go Services: 3 •  Primary Data Source: Oracle •  Time: 360 minutes •  Data Size: 1 year

Description Measure

Total rows processed 450 million

De-normalized rows 11.8 million

Rate of processing (Go Services) ~300k tps

Rate of processing (Platform) ~21k tps

% time waiting on data ingestion 75%

Page 18: Capital One: Using Cassandra In Building A Reporting Platform

Challenges

•  Not all query patterns are known in advance •  Index rebuilds are costly •  Business adjusting to near real-time data •  Operational support adjustments •  Backup/Restore •  Finding Talent – We are hiring!

18 © 2015. All Rights Reserved.

Page 19: Capital One: Using Cassandra In Building A Reporting Platform

Thank you