silicon valley data science: from oracle to cassandra with spark

FROM ORACLE TO CASSANDRA WITH SPARK

@TheAllantGroup | @SVDataScience 2 © 2015. ALL RIGHTS RESERVED.

WHO ARE WE?

Shambho Krishnasamy Fausto Inestroza


CUSTOMER RECOGNITION

Challenges in the Digital age –  Scalability

–  Throughput

–  Cost



Key Management Tailoring Key Assignment Hygiene

Func6onal Buckets


LEGACY APPLICATION

Input Output

Recogni6on Bus Service

Party Address HHLD Indv Digital Keying Lookup Reference

Address

Household

Individual

DigitalKey

Digi-‐Asso

Reference


LEGACY SOLUTION

JMS


NEED FOR CHANGE? RE-‐PLATFORM ! RE-‐ARCHITECT !


LIMITATIONS TO SCALE – MESSAGE PROCESSING ARCHITECTURE

•  Message processing engine

•  Common API to handle real-time and batch

•  Batch is converted into messages


LIMITATIONS TO SCALE – DATA THROUGHPUT

4-‐8 MM records/hour

Volume Performance

Scale to meet Allant’s Audience Interconnect® customer recogni6on needs


LIMITATIONS TO SCALE – SCALING HORIZONTALLY

Locking!


LIMITATIONS TO SCALE – SCALING VERTICALLY

=


WHAT DO WE WANT?

Increase throughput

Improve scalability

Elas6c infrastructure

(but don’t compromise on real-‐6me API capability!)

(but contain cost!)

(well… so we went Cloud)


WHAT TO RE-PLATFORM?

?

JMS


CASSANDRA


Consistent Reads! Consistent Writes!

SWITCH DATA STORE

JMS


WE’RE DONE!


BUT…APPLICATION LAYER IS STILL A BOTTLENECK


MUST MAINTAIN EXISTING LOGIC!


RECAP…


RECAP – BASELINE

JMS


RECAP -‐ LEGACY APPLICATION

Input Output

Recogni6on Bus Service

Party Address HHLD Indv Digital Keying Lookup Reference

Address

Household

Individual

DigitalKey

Digi-‐Asso

Reference


HOW DID WE DO IT?


JMS

INTRODUCED CASSANDRA…

But Cassandra is very bored…


What about this part?

HOW TO RE-ARCHITECT?

JMS

Cassandra is very bored…


NOW INTRODUCE HADOOP We employed Distributed Data Management Technology end-‐to-‐end…

Cassandra is very happy!


PERFORMANCE BENCHMARK RESULTS - ENVIRONMENT

•  12 Cassandra Nodes –  4 CPU –  15GB RAM –  80GB SSD

•  6 Hadoop Nodes –  32 CPU –  60GB RAM –  640GB SSD


PERFORMANCE BENCHMARK RESULTS - MAPREDUCE

Environment Results

JMS – Oracle 4.5 Million / Hour

MapReduce – Cassandra 44 Million / Hour

Benchmark 1: Smaller Input (~15 Million Profiles)

~10x

Environment Results



Benchmark 1: Larger Input (~400 Million Profiles)

~20x

From 6-7 days down to ~8 hours!


COULD WE DO BETTER?


INTRODUCE SPARK

Cassandra is ecsta6c!


EMPLOY DATASTAX LIGHTNING FAST CONNECTOR


PERFORMANCE BENCHMARK RESULTS - SPARK

Environment Results



Spark – Cassandra 125 Million / Hour [185 Million / Hour for “match only”]

Benchmark 1: Larger Input (~400 Million Profiles)

~50x

From 6-7 days down to ~3 hours!


TAKEAWAYS

•  We did contain cost! – with be^er throughput & scalability

•  Pu`ng Cassandra to work by employing MapReduce and Spark

•  Unimpeded throughput regardless of the data-‐store volume

•  Unique Key Genera6on under distributed data technology

•  Resolving Latency vs. Throughput -‐ Tradi6onal Conflict

•  In our use-‐case, the data-‐store •  Is encapsulated !

•  Has only controlled access !

•  Does only Reads and Writes !


THANK YOU