blazing fast analytics with mongodb & spark

Post on 16-Apr-2017

330 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Blazing Fast Analytics with MongoDB & Spark

3

Muthu Chinnasamy

Senior Solutions Architectmuthu@mongodb.comTwitter: @MuthuMongo

4

Agenda

The data challengeSparkUse CasesConnectorsDemo

2010

Eric Schmidt

Every two days now we create as much information as we did from the dawn of civilization up until 2003

Apache Spark is the Taylor Swift of big data software.

Derrick Harris, Fortune

8

What is Spark?

Fast and general computing engine for clusters

• Makes it easy and fast to process large datasets• APIs in Java, Scala, Python, R• Libraries for SQL, streaming, machine learning, Graph• It’s fundamentally different to what’s come before

9

Why not just use Hadoop?

• Spark is FAST– Faster to write.– Faster to run.

• Up to 100x faster than Hadoop in memory• 10x faster on disk.

A visual comparison

Hadoop Spark

11

RDD Operations

Transformations Actionsmap reducefilter collectflatMap countmapPartitions savesample lookupKeyunion takejoin foreachgroupByKeyreduceByKey

12

Spark higher level libraries

Spark

Spark SQL

Spark Streaming MLIB GraphX

Spark + MongoDB

14

Data Management

OLTPApplicationsFine grained operationsLow Latency

Offline Processing Analytics Data WarehousingHigh Throughput

15

Spark + MongoDB top use cases:– Business Intelligence– Data Warehousing – Recommendation– Log processing– User Facing Services– Fraud detection

16

MongoDB and Spark

17

Spark reading directly from MongoDB

18

Aggregation pipeline to Pre-filter

Aggregation pipeline filter: $match

19

Spark writing directly to MongoDB

Fraud Detection

I'm so in love!

Me, too<3

Now send me your CC number

?

Ok, XXXX-123-zzz

$$$

Fraud Detection

Sharing Workloads

Chat App

HDFS HDFS HDFS ArchivingData Crunching

LoginUser ProfileContactsMessages…

Fraud DetectionSegmentationRecommendations

Spark

MongoDB + Spark Connector

24

MongoDB Spark Connectorhttps://spark-packages.org/?q=official+mongodb

MongoDB Spark

Connector

MongoDB Shard

Spark

MongoDB Spark Connector

https://github.com/mongodb/mongo-spark

Spark Streaming

27

Spark Streaming

Twitter Feed Spark

28

Spark Streaming

Twitter Feed

{"statuses": [{"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24

03:35:21 +0000 2012","id_str": "250075927172759552","entities": {"urls": [

],"hashtags": [{"text": "freebandnames","indices": [20,34

]}

],"user_mentions": []}

}}

29

Spark Streaming{"statuses": [{"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24

03:35:21 +0000 2012","id_str": "250075927172759552","entities": {"urls": [

],"hashtags": [{

"text": "freebandnames","indices": [20,34

]}

],"user_mentions": []}

}}

{"time": "Mon Sep 24 03:35","freebandnames": 1

}

{"statuses": [{"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24

03:35:21 +0000 2012","id_str": "250075927172759552","entities": {"urls": [

],"hashtags": [

{"text": "freebandnames","indices": [20,34

]}

],"user_mentions": []}

}}

{"statuses": [{"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24

03:35:21 +0000 2012","id_str": "250075927172759552","entities": {

"urls": [

],"hashtags": [{"text": "freebandnames","indices": [20,34

]}

],"user_mentions": []}

}}

{"statuses": [{

"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24

03:35:21 +0000 2012","id_str": "250075927172759552","entities": {"urls": [

],"hashtags": [{"text": "freebandnames","indices": [20,34

]}

],"user_mentions": []}

}}

{"time": "Mon Sep 24 03:35","freebandnames": 4

}

Spark

30

Capped Collection

MongoDB and Spark Streaming feature

{"time": "Mon Sep 24 03:35","freebandnames": 4

}

{"time": "Mon Nov 5 09:40",“mongoDBLondon": 400

}

{"time": "Mon Nov 5 11:50",“spark": 7556

}

{"time": "Mon Nov 24 12:50","itshappening": 100

}

Tailable Cursor

MongoDB + Spark MLib Demo

32

Collaborative Filtering

• Two parts• Collaborative: Using Rating preference from several Users• Filtering: Recommend preferences

UserId / MovieId Star Wars Toy Story Frozen

Buzz 4 4 5

Woody 5 4

Jessie 5 ?

Movie Ratings as a matrix

33

MLib ALS

• Approximate into User & Movie latent factor matrices

UserId / MovieId

Frozen ToyStory

Star Wars

Buzz 4 4 5

Woody 5 4

Jessie 5

Buzz x y

Woody x y

Jessie x y

Star Wars

Toy Story

Frozen

x x x

y y y

f(i)

f(j)

rij

34

Prediction Process

• Load movie ratings data from MongoDB• Reflect and Infer the input formats for the ALS algorithm• Split the data

– 80% for training and 20% for validating the model• Calculate the best model using ALS algorithm

– Build/train a User Movie matrix model• Combine the data with user preferences and retrain the

model

35

Explore as a Databricks Notebookhttp://cdn2.hubspot.net/hubfs/438089/notebooks/MongoDB_guest_blog/Using_MongoDB_Connector_for_Spark.html

MongoDB + Spark Case Study

37

China Eastern Airlines – Fare Engine

130K seats,180 million fares & 1.6 billion daily searches

38

Spark and MongoDB

• An extremely powerful combination

• Many possible use cases

• Some operations are actually faster if performed using Aggregation Framework

• Evolving all the time

Questions?

Muthu Chinnasamymuthu@mongodb.com@muthumongo

top related