blazing fast analytics with mongodb & spark

Blazing Fast Analytics with MongoDB & Spark

Muthu Chinnasamy

Senior Solutions Architectmuthu@mongodb.comTwitter: @MuthuMongo

Agenda

The data challengeSparkUse CasesConnectorsDemo

Eric Schmidt

Every two days now we create as much information as we did from the dawn of civilization up until 2003

Apache Spark is the Taylor Swift of big data software.

Derrick Harris, Fortune

What is Spark?

Fast and general computing engine for clusters

• Makes it easy and fast to process large datasets• APIs in Java, Scala, Python, R• Libraries for SQL, streaming, machine learning, Graph• It’s fundamentally different to what’s come before

Why not just use Hadoop?

• Spark is FAST– Faster to write.– Faster to run.

• Up to 100x faster than Hadoop in memory• 10x faster on disk.

A visual comparison

Hadoop Spark

RDD Operations

Transformations Actionsmap reducefilter collectflatMap countmapPartitions savesample lookupKeyunion takejoin foreachgroupByKeyreduceByKey

Spark higher level libraries

Spark SQL

Spark Streaming MLIB GraphX

Spark + MongoDB

Data Management

OLTPApplicationsFine grained operationsLow Latency

Offline Processing Analytics Data WarehousingHigh Throughput

Spark + MongoDB top use cases:– Business Intelligence– Data Warehousing – Recommendation– Log processing– User Facing Services– Fraud detection

MongoDB and Spark

Spark reading directly from MongoDB

Aggregation pipeline to Pre-filter

Aggregation pipeline filter: $match

Spark writing directly to MongoDB

Fraud Detection

I'm so in love!

Me, too<3

Now send me your CC number

Ok, XXXX-123-zzz

Fraud Detection

Sharing Workloads

Chat App

HDFS HDFS HDFS ArchivingData Crunching

LoginUser ProfileContactsMessages…

Fraud DetectionSegmentationRecommendations

MongoDB + Spark Connector

MongoDB Spark Connectorhttps://spark-packages.org/?q=official+mongodb

MongoDB Spark

Connector

MongoDB Shard

MongoDB Spark Connector

https://github.com/mongodb/mongo-spark

Spark Streaming

Twitter Feed Spark

Spark Streaming

Twitter Feed

{"statuses": [{"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24

03:35:21 +0000 2012","id_str": "250075927172759552","entities": {"urls": [

],"hashtags": [{"text": "freebandnames","indices": [20,34

],"user_mentions": []}

Spark Streaming{"statuses": [{"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24

],"hashtags": [{

"text": "freebandnames","indices": [20,34

{"time": "Mon Sep 24 03:35","freebandnames": 1

],"hashtags": [

{"text": "freebandnames","indices": [20,34

03:35:21 +0000 2012","id_str": "250075927172759552","entities": {

"urls": [

{"statuses": [{

"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24

Capped Collection

MongoDB and Spark Streaming feature

{"time": "Mon Nov 5 09:40",“mongoDBLondon": 400

{"time": "Mon Nov 5 11:50",“spark": 7556

{"time": "Mon Nov 24 12:50","itshappening": 100

Tailable Cursor

MongoDB + Spark MLib Demo

Collaborative Filtering

• Two parts• Collaborative: Using Rating preference from several Users• Filtering: Recommend preferences

UserId / MovieId Star Wars Toy Story Frozen

Buzz 4 4 5

Woody 5 4

Jessie 5 ?

Movie Ratings as a matrix

MLib ALS

• Approximate into User & Movie latent factor matrices

UserId / MovieId

Frozen ToyStory

Star Wars

Buzz 4 4 5

Woody 5 4

Jessie 5

Buzz x y

Woody x y

Jessie x y

Star Wars

Toy Story

Frozen

Prediction Process

• Load movie ratings data from MongoDB• Reflect and Infer the input formats for the ALS algorithm• Split the data

– 80% for training and 20% for validating the model• Calculate the best model using ALS algorithm

– Build/train a User Movie matrix model• Combine the data with user preferences and retrain the

Explore as a Databricks Notebookhttp://cdn2.hubspot.net/hubfs/438089/notebooks/MongoDB_guest_blog/Using_MongoDB_Connector_for_Spark.html

MongoDB + Spark Case Study

China Eastern Airlines – Fare Engine

130K seats,180 million fares & 1.6 billion daily searches

Spark and MongoDB

• An extremely powerful combination

• Many possible use cases

• Some operations are actually faster if performed using Aggregation Framework

• Evolving all the time

Questions?

Muthu Chinnasamymuthu@mongodb.com@muthumongo

blazing fast analytics with mongodb & spark

Documents

blazing squids #08

blazing - racmp.co.uk

blazing squids 11

mongodb days uk: mongodb and spark

mongodb + spark

one tool to rule them all- seamless sql on mongodb, mysql...

blazing pens v2.0

mongodb & spark

blazing new rails

1. spark dataframes + sql - systems group · 2019-06-11 ·...

1. spark dataframes + sql€¦ · spark + mongodb 1. spark...

webinar: mongodb connector for spark

data driven performance repository to classify and ... ·...

agastya's blazing fireflies

mongodb training | mongodb online training | mongodb...

blazing squids #02

blazing noodlez menu

fluentd + mongodb + spark = awesome sauce

blazing squids #01

atc trail blazing