mongodb evenings dallas: what's the scoop on mongodb & hadoop

Tweet tonight using #MongoDB to be entered to win a t-shirt!

Welcome to MongoDB Evenings Dallas! Agenda

5:30pm: Pizza, Beer & Soft Drinks

6:00pm: Welcome William Kent, Enterprise Account Executive, MongoDB

6:10pm: What’s the Scoop on MongoDB & Hadoop? Jake Angerman, Senior Solutions Architect, MongoDB

7:00pm: Acxiom's Journey with MongoDB John Riewerts, Director of Engineering - Marketing Services, Acxiom

7:45pm: Announcements

MongoDB + Hadoop

{ Name: ‘Jake Angerman’, Title: ‘Sr. Solutions Architect’}

MongoDB + Hadoop

Documents Support Modern Requirements

Relational Document Data Structure { customer_id : 123,

first_name : "Mark",last_name : "Smith",city : "San Francisco",location : [37.7576792,-122.5078112],image : <binary data>,phones: [ {

number : “1-212-555-1212”, dnc : true,

type : “home”},{

number : “1-212-555-1213”, type : “cell”

MongoDB Technical Capabilities

Applica'on

Driver

Mongos

Primary

Secondary

Shard1

Primary

Secondary

Shard2

… Primary

Secondary

ShardN

1. DynamicDocumentSchema { name: “John Smith”, date: “2013-08-01”, address: “10 3rd St.”, phone: {

home: 1234567890, mobile: 1234568138 } }

db.customer.insert({…})db.customer.find({ name: ”John Smith”})

2.Na3velanguagedrivers

4.Highperformance- Datalocality- Indexes- RAM

3.Highavailability- Replicasets

5.Horizontalscalability- Sharding

Hadoop

A framework for distributed processing of large data sets •  Terabyte and petabyte datasets •  Data warehousing •  Advanced analytics •  Not a database •  No indexes •  Batch processing

Data Management

Hadoop Fault tolerance Batch processing Coarse-grained operations Unstructured Data WORM

MongoDB High availability Mutable data Fine-grained operations Flexible Schemas CRUD

Data Management

Hadoop Offline Processing Analytics Data Warehousing

MongoDB Online Operations Application Operational

Commerce Use Case

Applications powered by

Analysis powered by

•  Products & Inventory •  Recommended products •  Customer profile •  Session management

•  Elastic pricing •  Recommendation models •  Predictive analytics •  Clickstream history

MongoDB Connector for

Hadoop

Insurance Use Case

Applications powered by

Analysis powered by

•  Customer profiles •  Insurance policies •  Session data •  Call center data

•  Customer action analysis •  Churn analysis •  Churn prediction •  Policy rates

Hadoop

Fraud Detection Use Case

Payments

Fraud modeling

Nightly Analysis

MongoDB Connector for Hadoop

Results Cache

Online payments processing

3rd Party Data

Sources

Fraud Detection

query only

Where MongoDB Fits in the Hadoop Ecosystem

MapReduce

Pig Hive

Spark Streaming

Spark Shell

Mesos Hadoop

Spark SQL

Stand Alone

•  Low latency •  Rich fast querying •  Flexible indexing •  Aggregations in database •  Known data relationships •  Great for any subset of data

•  Longer jobs •  Batch analytics •  Highly parallel processing •  Unknown data relationships •  Great for looking at all data or

large subsets

Applications Distributed Analytics

Hadoop

db.tweets.aggregate([{$group:{_id:{hour:{$hour:"$date"},minute:{$minute:"$date"}},total:{$sum:"$sentiment.score"},average:{$avg:"$sentiment.score"},count:{$sum:1},happyTalk:{$push:"$sentiment.positive"}}},{$unwind:"$happyTalk"},{$unwind:"$happyTalk"},{$group:{_id:"$_id",total:{$first:"$total"},average:{$first:"$average"},count:{$first:"$count"},happyTalk:{$addToSet:"$happyTalk"}}},{$sort:{_id:-1}}])

But doesn't MongoDB have…? •  aggregation framework

–  machine learning libraries

•  map reduce –  Javascript –  competing workloads

MongoDB Data Operations Spectrum •  Document Retrieval – 1ms if in cache, ~10ms from

spinning disk •  .find() – per-document cost similar to single document

–  _id range –  any secondary index range, can be composite key –  intersect two indexes –  covered indexes even faster

•  .count(), .distinct(), .group() – fast, may be covered •  .aggregate() – retrieval cost like find, plus pipeline

operations –  $match, $group –  $project, $redact

•  .mapReduce() – in-database Javascript •  Hadoop Connector

–  mongo.input.query for indexed partial scan –  full scan

Faster……………

.....Slow

Analytics Landscape

Batch / Predictive / Ad Hoc (mins – hours)

Real-Time Dashboards / Scoring (<30 ms)

Planned Reporting (secs – mins )

Modern

Legacy

Use Cases

MetLife – Single View

SingleCSRApplica3onUnifiedCustomerPortalOpera3onalRepor3ng

Cards … CardsSilo1

Opera'onalDataLayer•  Insurancepolicies•  Demographicdata•  Customerwebdata•  Callcenterdata DW/DataLake

•  Churnpredic3onalgorithms

MongoDBConnectorforHadoop

CardsCardsSilo2 CardsCardsSiloN

Pub-sub/ETL

CustomerClustering

ChurnAnalysis

Predic3veanaly3cs

Foursquare

•  k-nearest neighbor problems – similarity of venues, people, or brands

•  MongoDB data has advantages when used with MapReduce –  log files can be stale –  log files may not contain as much information – you can scan much less data

BSON dump

MongoDBConnectorforHadoop

https://github.com/mongodb/mongo-hadoop

Connector Overview

Read/WriteMongoDB

Read/WriteBSON

MapReduce

PlaUorms

ApacheHadoop

ClouderaCDH

HortonworksHDP

AmazonEMR

MapReduce Configuration

•  MongoDB input –  mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat–  mongo.input.uri=mongodb://mydb:27017/db1.collection1

•  MongoDB output –  mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat–  mongo.output.uri=mongodb://mydb:27017/db1.collection2

•  BSON input/output –  mongo.job.input.format=com.hadoop.BSONFileInputFormat–  mapred.input.dir=hdfs:///tmp/database.bson–  mongo.job.output.format=com.hadoop.BSONFileOutputFormat–  mapred.output.dir=hdfs:///tmp/output.bson

MongoDB Cluster

MONGOS

SHARD A

SHARD B

SHARD C

SHARD D

MONGOS Client

extends MongoSplitter class

List<InputSplit> calculateSplits()

•  High-level platform for creating MapReduce •  Pig Latin abstracts Java into easier-to-use notation •  Executed as a series of MapReduce applications •  Supports user-defined functions (UDFs)

samples = LOAD 'mongodb://127.0.0.1:27017/sensor.logs' USING com.mongodb.hadoop.pig.MongoLoader(’deviceId:int,value:double');

grouped = GROUP samples by deviceId;

sample_stats = FOREACH grouped { mean = AVG(samples.value); GENERATE group as deviceId, mean as mean;}

STORE sample_stats INTO 'mongodb://127.0.0.1:27017/sensor.stats' USING com.mongodb.hadoop.pig.MongoStorage;

•  Data warehouse infrastructure built on top of Hadoop •  Provides data summarization, query, and analysis •  HiveQL is a subset of SQL •  Support for user-defined functions (UDFs)

Hive Support

CREATETABLEmongo_users(idint,namestring,ageint)STOREDBY"com.mongodb.hadoop.hive.MongoStorageHandler"WITHSERDEPROPERTIES("mongo.columns.mapping”="_id,name,age”)TBLPROPERTIES("mongo.uri"="mongodb://host:27017/test.users”)

•  Access collections as Hive tables •  Use with MongoStorageHandler or BSONStorageHandler

41 Image source: dzone.com

•  Powerful built-in transformations and actions –  map, reduceByKey, union, distinct, sample, intersection, and more

–  foreach, count, collect, take, and many more

An engine for processing Hadoop data. Can perform MapReduce in addition to streaming, interactive queries, and machine learning.

Create the Resilient Distributed Dataset (RDD)

rdd = sc.newAPIHadoopRDD(

config, MongoInputFormat.class, Object.class, BSONObject.class)

config.set(

"mongo.input.uri", "mongodb://127.0.0.1:27017/marketdata.minbars")

config.set(

"mongo.input.query", '{"_id":{"$gt":{"$date":1182470400000}}}')

config.set(

"mongo.output.uri", "mongodb://127.0.0.1:27017/marketdata.fiveminutebars")

val minBarRawRDD = sc.newAPIHadoopRDD(

config,

classOf[com.mongodb.hadoop.MongoInputFormat],

classOf[Object],

classOf[BSONObject])

Spark Demo

K-means Clustering

>>> mongo_rdd = sc.mongoRDD('mongodb://localhost:27017/adsb.tincan') >>> parsed_rdd = mongo_rdd.map(parseData) >>> clusters = KMeans.train(parsed_rdd, 10, maxIterations=10, runs=1, initializationMode="random") >>> cluster_sizes = parsed_rdd.map(lambda e: clusters.predict(e)).countByValue() >>> cluster_sizes defaultdict(<type 'int'>, {0: 70122, 1: 350890, 2: 118596, 3: 104609, 4: 254759, 5: 175840, 6: 166789, 7: 68309, 8: 147826, 9: 495102})

K-means Clustering

Summary

Data Flows

Hadoop Connector

BSON Files

MapReduce & HDFS

Optimal location for providing operational response times & slices

Governance to choose where to load and process data

More Complete EDM Architecture & Data Lake

Siloedsourcedatabases

Externalfeeds(batch)

Streams

Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png

Data processing pipeline

Pub-sub, E

TL, file imports

Stream Processing Downstream

Systems

… …

SingleCSRApplica3on

UnifiedDigitalApps

Opera3onalRepor3ng

… …

Analy3cRepor3ng

Drivers & Stacks

CustomerClusterin

ChurnAnalysis

Predic3ve

Analy3cs

Distributed Processing

Operational Applications & Reporting

Can run processing on all data or slices

Data Lake

Code “JakeAngerman” gets 25% off Super Early Bird Registration Ends March 25, 2016

June 28 - 29, 2016 New York, NY www.mongodbworld.com

mongodb evenings dallas: what's the scoop on mongodb & hadoop

Technology

realtime analytics with mongodb - mongodb meetup nyc

mongodb: what, why, when. solutions architect, mongodb inc....

mongodb evenings dc: get mean and lean with docker and...

mongodb evenings dc: mongodb - the new default database for...

mongodb europe 2016 - advanced mongodb aggregation pipelines

the mongodb strikes back / mongodb 의 역습

idh recruitment evenings

scoop your scoop your your poop call 311 poop · scoop your...

mongodb evenings chicago - windygrid: utilizing mongodb for...

territory managers - cyclingsportsgroupmedia.com · alm...

mongodb -...

mongodb in use(김인범, mongodb korea)

mongodb europe 2016 - mongodb 3.4 preview and introduction...

mongodb 3.0 migration - mongodb days munich

mongodb training | mongodb online training | mongodb...

mongodb europe 2016 - debugging mongodb performance

design evenings

cycling sports group uk price list csg bikes … · scoop...

mongodb europe 2016 - distributed ledgers, blockchain +...

time series data in mongodb senior solutions architect,...