data processing and aggregation with mongodb

69
Data Processing and Aggregation Senior Solutions Architect, MongoDB Inc [email protected]. Massimo Brignoli @massimobrignoli

Upload: mongodb

Post on 10-May-2015

1.840 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Data Processing and Aggregation with MongoDB

Data Processing and Aggregation

Senior Solutions Architect, MongoDB Inc

[email protected].

Massimo Brignoli

@massimobrignoli

Page 2: Data Processing and Aggregation with MongoDB

Chi sono?

•  Solutions Architect/Evangelist in MongoDB Inc.

•  24 anni di esperienza nel mondo dei database e dello sviluppo software

•  Ex dipendente di MySQL e MariaDB

•  In precedenza: web, web, web

Page 3: Data Processing and Aggregation with MongoDB

Big Data

Page 4: Data Processing and Aggregation with MongoDB

Innovation

Page 5: Data Processing and Aggregation with MongoDB

Understanding Big Data – It’s Not Very “Big”

from Big Data Executive Summary – 50+ top executives from Government and F500 firms

64% - Ingest diverse, new data in real-time

15% - More than 100TB of data 20% - Less than 100TB (average of all? <20TB)

Page 6: Data Processing and Aggregation with MongoDB

“I have not failed. I've just found 10,000 ways that won't work.” ― Thomas A. Edison

Page 7: Data Processing and Aggregation with MongoDB

Tante grandi innovazioni dal 1970…

Page 8: Data Processing and Aggregation with MongoDB

Ma usereste una di queste tecnologie per lanciare un nuovo business oggi?

Page 9: Data Processing and Aggregation with MongoDB

Incluso il modello relazionale dei dati!

Page 10: Data Processing and Aggregation with MongoDB

Per quali computer è stato pensato il modello relazionale?

Page 11: Data Processing and Aggregation with MongoDB

Questi erano i computer!

Page 12: Data Processing and Aggregation with MongoDB

E lo Storage?

Page 13: Data Processing and Aggregation with MongoDB

E come si sviluppava il software?

pio, il LISP (LISt Processing language) [24].A quel tempo, i problemi significativi non ri-guardavano tanto l’organizzazione del pro-cesso di produzione, di solito lasciato nellemani di pochi esperti, quanto la comprensio-ne e la memoria di ciò che veniva sviluppatoall’interno della stessa comunità degli svi-luppatori. Di qui, la necessità di trascrivere lacomputazione in un formalismo familiare aiprogrammatori dell’epoca; era il tempo delleastrazioni: formule matematiche (FORTRAN),astrazioni logiche (LISP) o transazioni econo-miche (COBOL)1. Inoltre, la diffusione e la di-versificazione del mercato dei calcolatorielettronici rese pressante il bisogno di scrive-re programmi che funzionassero su più piat-taforme di calcolo.Presto ci si accorse che questo non bastava(fine anni ’60 e primi anni ’70). I programmidiventavano più grandi e richiedevano la col-laborazione di più persone, anche più gruppidi persone. Occorreva strutturare il processodi sviluppo per renderlo più organico. Pren-dendo a prestito concetti e terminologie dal-le teorie di management in voga all’epoca, enei decenni precedenti per essere precisi, inprimis una versione molto statica del Fordi-smo [3], ci si concentrò sull’idea di organizza-re il processo di sviluppo in moduli indipen-

denti con interfacce chiare e componibili. Sidiffusero concetti quali la programmazionestrutturata, il data hiding e l’incapsulamen-to. Nacquero l’ALGOL [10], il C [21], il Pascal[20], il Modula 2 [35] e l’Ada [15]: anche ilFORTRAN fu fatto evolvere in questo senso.Il processo di produzione modulare nella suatraduzione informatica, il modello di svilup-po “a cascata”, dominò gli anni ’70 e i primianni ’80. I limiti di questo modello, soprattut-to la sua incapacità di gestire la flessibilità ri-chiesta dalla produzione del software, co-minciarono a essere evidenti verso la metàdegli anni ’80. Ci furono piccole variazioni dirotta, come i modelli di sviluppo “a V”, masostanzialmente il modello modulare e gli as-sociati linguaggi strutturati rimasero predo-minanti. L’idea guida era che la mancanza diprecisione, e l’incertezza che la causava, po-teva e doveva essere risolta a priori, tramitespecifiche più accurate e circostanziate.Negli anni ’80 la comunità scientifica acquisìla consapevolezza che i problemi non eranosolo legati alla carenza umana nelle attività didefinizione del sistema, ma anche all’esisten-za di una zona di ombra intrinseca allo svilup-po di un qualunque sistema software, chenon permetteva di definire in modo completoe corretto fin dall’inizio le caratteristiche cheil sistema software avrebbe avuto alla fine.Il punto di partenza per lo sviluppo di un si-stema software sono, in effetti, bisogni veri opercepiti come tali.Ma tanti di questi bisogni si esprimono non in

M O N D O D I G I T A L E • n . 4 - d i c e m b r e 2 0 0 3

1

0

0

0

1

40

FIGURA 1Evoluzione dei

linguaggi

Processo Bisogno Linguaggio1950

1960

1970

1980

1990

2000

Primi tentativi di “ordine”nello sviluppo

Comprensibilità e portabilità del codice,per sostenere la sua evoluzione

Organizzazione “industriale”dello sviluppo dei sistemi software

Impossibilità di definire in modopreciso il sistema da sviluppare

Sviluppo e distribuzione moltorapidi e orientati ai sistemi

di comunicazione

Waterfall, a “V”, ...

Incrementale, Spirale, ...

Metodologie agili

Linguaggi assemblativi

Linguaggi di alto livello

Linguaggi strutturati

Linguaggi orientati agli oggetti

Linguaggi per lo sviluppodinamico

1 Si tralascia di menzionare altri linguaggi che, purmolto popolari all’epoca, direbbero molto poco allettore di oggi.

Page 14: Data Processing and Aggregation with MongoDB

RDBMS Rende lo Sviluppo Difficile

Relational Database

Object Relational Mapping Application

Code XML Config DB Schema

Page 15: Data Processing and Aggregation with MongoDB

E Ancora Più Difficile Evolverlo… New Table

New Table

New Column

Name Pet Phone Email

New Column

3 months later…

Page 16: Data Processing and Aggregation with MongoDB

RDBMS

Dalla Complessità alla Semplicità..

MongoDB

{ _id : ObjectId("4c4ba5e5e8aabf3"),

employee_name: "Dunham, Justin", department : "Marketing",

title : "Product Manager, Web", report_up: "Neray, Graham",

pay_band: “C", benefits : [

{ type : "Health", plan : "PPO Plus" },

{ type : "Dental",

plan : "Standard" } ]

}

Page 17: Data Processing and Aggregation with MongoDB

Che cos’è un Record?

Page 18: Data Processing and Aggregation with MongoDB

Chiave → Valore

•  Storage mono-dimensionale

•  Il singolo valore e’ un blob

•  Le query sono solo per chiave

•  Nessuno schema

•  I valore non può essere aggiornato ma solamente sovrascritto

Key Blob

Page 19: Data Processing and Aggregation with MongoDB

Relazionale

•  Storage bi-dimensionale (tuple)

•  Ogni campo contiene solo un valore

•  Query sono su ogni campo

•  Schema molto strutturato (tabelle)

•  Update sul posto

•  Il processo di normalizzazione richiede molte tabelle, indici e con una pessima localizzazione dei dati.

Primary Key

Page 20: Data Processing and Aggregation with MongoDB

Documento

•  Storage N-dimensionale

•  Ogni campo può contenere 0,1, tanti o valori incapsulati

•  Query su tutti i campi e livelli

•  Schema dinamico

•  Update in linea

•  Incapsulare i dati migliora la localizzazione dei dati, richiede meno indici e ha migliori performance

_id

Page 21: Data Processing and Aggregation with MongoDB

For over a decade

Big Data == Custom Software

Page 22: Data Processing and Aggregation with MongoDB

In the past few years Open source software has emerged enabling the rest of us to handle Big Data

Page 23: Data Processing and Aggregation with MongoDB

How MongoDB Meets Our Requirements

•  MongoDB is an operational database

•  MongoDB provides high performance for storage and retrieval at large scale

•  MongoDB has a robust query interface permitting intelligent operations

•  MongoDB is not a data processing engine, but provides processing functionality

Page 24: Data Processing and Aggregation with MongoDB

http://www.flickr.com/photos/torek/4444673930/

MongoDB data processing options

Page 25: Data Processing and Aggregation with MongoDB

Getting Example Data

Page 26: Data Processing and Aggregation with MongoDB

The “hello world” of MapReduce is counting words in a paragraph of text.

Let’s try something a little more interesting…

Page 27: Data Processing and Aggregation with MongoDB

What is the most popular pub name?

Page 28: Data Processing and Aggregation with MongoDB

#!/usr/bin/env python # Data Source # http://www.overpass-api.de/api/xapi?*[amenity=pub][bbox=-10.5,49.78,1.78,59] import re import sys from imposm.parser import OSMParser import pymongo class Handler(object): def nodes(self, nodes): if not nodes: return docs = [] for node in nodes: osm_id, doc, (lon, lat) = node if "name" not in doc: node_points[osm_id] = (lon, lat) continue doc["name"] = doc["name"].title().lstrip("The ").replace("And", "&") doc["_id"] = osm_id doc["location"] = {"type": "Point", "coordinates": [lon, lat]} docs.append(doc) collection.insert(docs)

Open Street Map Data

Page 29: Data Processing and Aggregation with MongoDB

{ "_id" : 451152, "amenity" : "pub", "name" : "The Dignity", "addr:housenumber" : "363", "addr:street" : "Regents Park Road", "addr:city" : "London", "addr:postcode" : "N3 1DH", "toilets" : "yes", "toilets:access" : "customers", "location" : { "type" : "Point", "coordinates" : [-0.1945732, 51.6008172] } }

Example Pub Data

Page 30: Data Processing and Aggregation with MongoDB

MongoDB MapReduce • 

MongoDB

map

reduce

finalize

Page 31: Data Processing and Aggregation with MongoDB

Map Function

> var map = function() {

emit(this.name, 1);

MongoDB

map

reduce finalize

Page 32: Data Processing and Aggregation with MongoDB

Reduce Function

> var reduce = function (key, values) {

var sum = 0;

values.forEach( function (val) {sum += val;} );

return sum;

}

MongoDB

map

reduce finalize

Page 33: Data Processing and Aggregation with MongoDB

Results > db.pubs.mapReduce(map, reduce, { out: "pub_names", query: { } } ) > db.pub_names.find().sort({value: -1}).limit(10) { "_id" : "The Red Lion", "value" : 407 } { "_id" : "The Royal Oak", "value" : 328 } { "_id" : "The Crown", "value" : 242 } { "_id" : "The White Hart", "value" : 214 } { "_id" : "The White Horse", "value" : 200 } { "_id" : "The New Inn", "value" : 187 } { "_id" : "The Plough", "value" : 185 } { "_id" : "The Rose & Crown", "value" : 164 } { "_id" : "The Wheatsheaf", "value" : 147 } { "_id" : "The Swan", "value" : 140 }

Page 34: Data Processing and Aggregation with MongoDB
Page 35: Data Processing and Aggregation with MongoDB

> db.pubs.mapReduce(map, reduce, { out: "pub_names", query: { location: { $within: { $centerSphere: [[-0.12, 51.516], 2 / 3959] } }} }) { "result" : "pub_names", "timeMillis" : 116, "counts" : { "input" : 643, "emit" : 643, "reduce" : 54, "output" : 537 }, "ok" : 1, }

Pub Names in the Center of London

Page 36: Data Processing and Aggregation with MongoDB

> db.pub_names.find().sort({value: -1}).limit(10) { "_id" : "All Bar One", "value" : 11 } { "_id" : "The Slug & Lettuce", "value" : 7 } { "_id" : "The Coach & Horses", "value" : 6 } { "_id" : "The Green Man", "value" : 5 } { "_id" : "The Kings Arms", "value" : 5 } { "_id" : "The Red Lion", "value" : 5 } { "_id" : "Corney & Barrow", "value" : 4 } { "_id" : "O'Neills", "value" : 4 } { "_id" : "Pitcher & Piano", "value" : 4 } { "_id" : "The Crown", "value" : 4 }

Results

Page 37: Data Processing and Aggregation with MongoDB

MongoDB MapReduce

•  Real-time

•  Output directly to document or collection

•  Runs inside MongoDB on local data

− Adds load to your DB

− In Javascript – debugging can be a challenge

− Translating in and out of C++

Page 38: Data Processing and Aggregation with MongoDB

Aggregation Framework

Page 39: Data Processing and Aggregation with MongoDB

Aggregation Framework • 

MongoDB

op1

op2

opN

Page 40: Data Processing and Aggregation with MongoDB

Aggregation Framework in 60 Seconds

Page 41: Data Processing and Aggregation with MongoDB

Aggregation Framework Operators

•  $project

•  $match

•  $limit

•  $skip

•  $sort

•  $unwind

•  $group

Page 42: Data Processing and Aggregation with MongoDB

$match

•  Filter documents

•  Uses existing query syntax

•  If using $geoNear it has to be first in pipeline

•  $where is not supported

Page 43: Data Processing and Aggregation with MongoDB

Matching Field Values { "_id" : 271421, "amenity" : "pub", "name" : "Sir Walter Tyrrell", "location" : { "type" : "Point", "coordinates" : [ -1.6192422, 50.9131996 ] } } { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ] }

Matching Field Values

{ "$match": {

"name": "The Red Lion"

}}

{ "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ]} }

Page 44: Data Processing and Aggregation with MongoDB

$project

•  Reshape documents

•  Include, exclude or rename fields

•  Inject computed fields

•  Create sub-document fields

Page 45: Data Processing and Aggregation with MongoDB

Including and Excluding Fields {

"_id" : 271466,

"amenity" : "pub",

"name" : "The Red Lion",

"location" : {

"type" : "Point",

"coordinates" : [

-1.5494749,

50.7837119

]

}

}

{ “$project”: { “_id”: 0, “amenity”: 1, “name”: 1, }}

{ “amenity” : “pub”, “name” : “The Red Lion” }

Page 46: Data Processing and Aggregation with MongoDB

Reformatting Documents {

"_id" : 271466,

"amenity" : "pub",

"name" : "The Red Lion",

"location" : {

"type" : "Point",

"coordinates" : [

-1.5494749,

50.7837119

]

}

}

{ “$project”: { “_id”: 0, “name”: 1, “meta”: {

“type”: “$amenity”} }}

{ “name” : “The Red Lion” “meta” : {

“type” : “pub” }}

Page 47: Data Processing and Aggregation with MongoDB

$group

•  Group documents by an ID

•  Field reference, object, constant

•  Other output fields are computed $max, $min, $avg, $sum $addToSet, $push $first, $last

•  Processes all data in memory

Page 48: Data Processing and Aggregation with MongoDB

Back to the pub!

•  http://www.offwestend.com/index.php/theatres/pastshows/71

Page 49: Data Processing and Aggregation with MongoDB

Popular Pub Names >var popular_pub_names = [

{ $match : location: { $within: { $centerSphere: [[-0.12, 51.516], 2 / 3959]}}}

},

{ $group : { _id: “$name” value: {$sum: 1} }

},

{ $sort : {value: -1} },

{ $limit : 10 }

Page 50: Data Processing and Aggregation with MongoDB

> db.pubs.aggregate(popular_pub_names)

{

"result" : [ { "_id" : "All Bar One", "value" : 11 } { "_id" : "The Slug & Lettuce", "value" : 7 } { "_id" : "The Coach & Horses", "value" : 6 } { "_id" : "The Green Man", "value" : 5 } { "_id" : "The Kings Arms", "value" : 5 } { "_id" : "The Red Lion", "value" : 5 } { "_id" : "Corney & Barrow", "value" : 4 } { "_id" : "O'Neills", "value" : 4 } { "_id" : "Pitcher & Piano", "value" : 4 } { "_id" : "The Crown", "value" : 4 }

],

"ok" : 1

}

Results

Page 51: Data Processing and Aggregation with MongoDB

Aggregation Framework Benefits

•  Real-time

•  Simple yet powerful interface

•  Declared in JSON, executes in C++

•  Runs inside MongoDB on local data

− Adds load to your DB

− Limited Operators

− Data output is limited

Page 52: Data Processing and Aggregation with MongoDB

Analyzing MongoDB Data in External Systems

Page 53: Data Processing and Aggregation with MongoDB

MongoDB with Hadoop • 

MongoDB

Page 54: Data Processing and Aggregation with MongoDB

MongoDB with Hadoop • 

MongoDB warehouse

Page 55: Data Processing and Aggregation with MongoDB

MongoDB with Hadoop • 

MongoDB ETL

Page 56: Data Processing and Aggregation with MongoDB

#!/usr/bin/env python from pymongo_hadoop import BSONMapper def mapper(documents): bounds = get_bounds() # ~2 mile polygon for doc in documents: geo = get_geo(doc["location"]) # Convert the geo type if not geo: continue if bounds.intersects(geo): yield {'_id': doc['name'], 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping."

Map Pub Names in Python

Page 57: Data Processing and Aggregation with MongoDB

#!/usr/bin/env python

from pymongo_hadoop import BSONReducer

def reducer(key, values):

_count = 0

for v in values:

_count += v['count']

return {'_id': key, 'value': _count}

BSONReducer(reducer)

Reduce Pub Names in Python

Page 58: Data Processing and Aggregation with MongoDB

hadoop jar target/mongo-hadoop-streaming-assembly-1.1.0-rc0.jar \

-mapper examples/pub/map.py \

-reducer examples/pub/reduce.py \

-mongo mongodb://127.0.0.1/demo.pubs \

-outputURI mongodb://127.0.0.1/demo.pub_names

Execute MapReduce

Page 59: Data Processing and Aggregation with MongoDB

> db.pub_names.find().sort({value: -1}).limit(10) { "_id" : "All Bar One", "value" : 11 } { "_id" : "The Slug & Lettuce", "value" : 7 } { "_id" : "The Coach & Horses", "value" : 6 } { "_id" : "The Kings Arms", "value" : 5 } { "_id" : "Corney & Barrow", "value" : 4 } { "_id" : "O'Neills", "value" : 4 } { "_id" : "Pitcher & Piano", "value" : 4 } { "_id" : "The Crown", "value" : 4 } { "_id" : "The George", "value" : 4 } { "_id" : "The Green Man", "value" : 4 }

Popular Pub Names Nearby

Page 60: Data Processing and Aggregation with MongoDB

MongoDB and Hadoop

•  Away from data store •  Can leverage existing data processing infrastructure •  Can horizontally scale your data processing -  Offline batch processing -  Requires synchronisation between store & processor -  Infrastructure is much more complex

Page 61: Data Processing and Aggregation with MongoDB

The Future of Big Data and MongoDB

Page 62: Data Processing and Aggregation with MongoDB

What is Big Data? Big Data today will be normal tomorrow

Page 63: Data Processing and Aggregation with MongoDB

Exponential Data Growth

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

10000

2000 2002 2004 2006 2008 2010 2012

Billions of URLs indexed by Google

Page 64: Data Processing and Aggregation with MongoDB

MongoDB enables you to scale big

Page 65: Data Processing and Aggregation with MongoDB

MongoDB is evolving so you can process the big

Page 66: Data Processing and Aggregation with MongoDB

Data Processing with MongoDB

•  Process in MongoDB using Map/Reduce

•  Process in MongoDB using Aggregation Framework

•  Process outside MongoDB using Hadoop and other external tools

Page 67: Data Processing and Aggregation with MongoDB

MongoDB Integration

•  Hadoop https://github.com/mongodb/mongo-hadoop

•  Storm https://github.com/christkv/mongo-storm

•  Disco https://github.com/mongodb/mongo-disco

•  Spark Coming soon!

Page 68: Data Processing and Aggregation with MongoDB

Questions?

Page 69: Data Processing and Aggregation with MongoDB

Thanks!

[email protected]

Massimo Brignoli

@massimobrignoli