silicon valley data science: from oracle to cassandra with spark
TRANSCRIPT
@TheAllantGroup | @SVDataScience 2 © 2015. ALL RIGHTS RESERVED.
WHO ARE WE?
Shambho Krishnasamy Fausto Inestroza
@TheAllantGroup | @SVDataScience 3 © 2015. ALL RIGHTS RESERVED.
CUSTOMER RECOGNITION
Challenges in the Digital age – Scalability
– Throughput
– Cost
@TheAllantGroup | @SVDataScience 4 © 2015. ALL RIGHTS RESERVED.
CUSTOMER RECOGNITION
Key Management Tailoring Key Assignment Hygiene
Func6onal Buckets
@TheAllantGroup | @SVDataScience 5 © 2015. ALL RIGHTS RESERVED.
CUSTOMER RECOGNITION
Key Management Tailoring Key Assignment Hygiene
Func6onal Buckets
@TheAllantGroup | @SVDataScience 6 © 2015. ALL RIGHTS RESERVED.
LEGACY APPLICATION
Input Output
Recogni6on Bus Service
Party Address HHLD Indv Digital Keying Lookup Reference
Address
Household
Individual
DigitalKey
Digi-‐Asso
Reference
@TheAllantGroup | @SVDataScience 8 © 2015. ALL RIGHTS RESERVED.
NEED FOR CHANGE? RE-‐PLATFORM ! RE-‐ARCHITECT !
@TheAllantGroup | @SVDataScience 9 © 2015. ALL RIGHTS RESERVED.
LIMITATIONS TO SCALE – MESSAGE PROCESSING ARCHITECTURE
• Message processing engine
• Common API to handle real-time and batch
• Batch is converted into messages
@TheAllantGroup | @SVDataScience 10 © 2015. ALL RIGHTS RESERVED.
LIMITATIONS TO SCALE – DATA THROUGHPUT
4-‐8 MM records/hour
Volume Performance
Scale to meet Allant’s Audience Interconnect® customer recogni6on needs
@TheAllantGroup | @SVDataScience 11 © 2015. ALL RIGHTS RESERVED.
LIMITATIONS TO SCALE – SCALING HORIZONTALLY
Locking!
@TheAllantGroup | @SVDataScience 12 © 2015. ALL RIGHTS RESERVED.
LIMITATIONS TO SCALE – SCALING VERTICALLY
=
@TheAllantGroup | @SVDataScience 13 © 2015. ALL RIGHTS RESERVED.
WHAT DO WE WANT?
Increase throughput
Improve scalability
Elas6c infrastructure
(but don’t compromise on real-‐6me API capability!)
(but contain cost!)
(well… so we went Cloud)
@TheAllantGroup | @SVDataScience 16 © 2015. ALL RIGHTS RESERVED.
Consistent Reads! Consistent Writes!
SWITCH DATA STORE
JMS
@TheAllantGroup | @SVDataScience 18 © 2015. ALL RIGHTS RESERVED.
BUT…APPLICATION LAYER IS STILL A BOTTLENECK
@TheAllantGroup | @SVDataScience 22 © 2015. ALL RIGHTS RESERVED.
RECAP -‐ LEGACY APPLICATION
Input Output
Recogni6on Bus Service
Party Address HHLD Indv Digital Keying Lookup Reference
Address
Household
Individual
DigitalKey
Digi-‐Asso
Reference
@TheAllantGroup | @SVDataScience 24 © 2015. ALL RIGHTS RESERVED.
JMS
INTRODUCED CASSANDRA…
But Cassandra is very bored…
@TheAllantGroup | @SVDataScience 25 © 2015. ALL RIGHTS RESERVED.
What about this part?
HOW TO RE-ARCHITECT?
JMS
Cassandra is very bored…
@TheAllantGroup | @SVDataScience 26 © 2015. ALL RIGHTS RESERVED.
NOW INTRODUCE HADOOP We employed Distributed Data Management Technology end-‐to-‐end…
Cassandra is very happy!
@TheAllantGroup | @SVDataScience 27 © 2015. ALL RIGHTS RESERVED.
PERFORMANCE BENCHMARK RESULTS - ENVIRONMENT
• 12 Cassandra Nodes – 4 CPU – 15GB RAM – 80GB SSD
• 6 Hadoop Nodes – 32 CPU – 60GB RAM – 640GB SSD
@TheAllantGroup | @SVDataScience 28 © 2015. ALL RIGHTS RESERVED.
PERFORMANCE BENCHMARK RESULTS - MAPREDUCE
Environment Results
JMS – Oracle 4.5 Million / Hour
MapReduce – Cassandra 44 Million / Hour
Benchmark 1: Smaller Input (~15 Million Profiles)
~10x
Environment Results
JMS – Oracle 2.5 Million / Hour
MapReduce – Cassandra 45 Million / Hour
Benchmark 1: Larger Input (~400 Million Profiles)
~20x
From 6-7 days down to ~8 hours!
@TheAllantGroup | @SVDataScience 30 © 2015. ALL RIGHTS RESERVED.
INTRODUCE SPARK
Cassandra is ecsta6c!
@TheAllantGroup | @SVDataScience 31 © 2015. ALL RIGHTS RESERVED.
EMPLOY DATASTAX LIGHTNING FAST CONNECTOR
@TheAllantGroup | @SVDataScience 32 © 2015. ALL RIGHTS RESERVED.
PERFORMANCE BENCHMARK RESULTS - SPARK
Environment Results
JMS – Oracle 2.5 Million / Hour
MapReduce – Cassandra 45 Million / Hour
Spark – Cassandra 125 Million / Hour [185 Million / Hour for “match only”]
Benchmark 1: Larger Input (~400 Million Profiles)
~50x
From 6-7 days down to ~3 hours!
@TheAllantGroup | @SVDataScience 33 © 2015. ALL RIGHTS RESERVED.
TAKEAWAYS
• We did contain cost! – with be^er throughput & scalability
• Pu`ng Cassandra to work by employing MapReduce and Spark
• Unimpeded throughput regardless of the data-‐store volume
• Unique Key Genera6on under distributed data technology
• Resolving Latency vs. Throughput -‐ Tradi6onal Conflict
• In our use-‐case, the data-‐store • Is encapsulated !
• Has only controlled access !
• Does only Reads and Writes !