mongodb + pig on hadoop (mongosv 2012)

Post on 26-Jan-2015

112 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

Slides from Mortar co-founder Jeremy Karn's presentation at MongoSV 2012. Learn to process Mongo data with Hadoop—specifically with Apache Pig. Jeremy's presentation covered the steps needed to read JSON from Mongo into Pig, parallel process it on Hadoop with sophisticated functions, and write back to Mongo. This talk will demonstrate its concepts with Mortar, which has contributed to the Mongo Hadoop connector, extending it to work with Pig.

TRANSCRIPT

Jeremy Karn - co-founder, MortarMongoDB + Pig

OF THIS SESSIONOverview

Intro to HadoopIntro to PigWhy MongoDB + Pig?Demo: load PigDemo: processing data with PigDemo: store data from Pig to MongoDB

RAPID OVERVIEWHadoop

RAPID OVERVIEWHadoop

Hadoop implements MapReduce (Java)(Doug Cutting)Incubated at YahooIndexing, Spam detection, more

STRENGTHSHadoop

ScalableOpen sourceLots of momentumVery broadly applicable

Social Graph

Predict

Detect

Genetics

PROBLEMSHadoop

DifficultBatch only (...or it was)

FUTUREHadoop

YarnMapReduce optionalGeneric management + distributed appsImpala

Alternatives to Hadoop

Write MapReduce in Javascript• Javascript is not fast• Has limited data types• Hard to use complex analytic libsAdds load to data store

MONGODB NATIVE MAPREDUCE

Hadoop has libs for• Machine learning• ETL• Can access any JVM analytic libsAnd many organizations already use Hadoop

Alternatives to HadoopMONGODB NATIVE MAPREDUCE

Alternatives to HadoopMONGODB AGGREGATION FRAMEWORK

Great when• Doing SQL-style aggregation• Do not require external data libs• Users will learn framework

Alternatives to HadoopMONGODB AGGREGATION FRAMEWORK

But you may want Hadoop when• Doing sophisticated aggregation• Require external data libs• Users unwilling to learn framework• Need to transfer workload off datastore

ON HADOOPPig

Less codeExpressive code

BRIEF, EXPRESSIVELIKE PROCEDURAL SQL

Pig

(thanks: twitter hadoop world presentation)

FOR SERIOUSThe Same Script, In MapReduce

ON HADOOPPig

Less code Expressive codeCompiles to MRInsulates from APIPopular (LinkedIn, Twitter, Salesforce, Yahoo, Stanford

MOTIVATIONSMongoDB + Pig

Data storage and data processing are often separate concerns

Hadoop is built for scalable processing of large datasets

SIMILAR STANCE MongoDB, Pig

Poly-structured data• MongoDB: stores data, regardless of

structure• Pig: reads data, regardless of structure

(got its name because Pigs are omnivorous)

JSON-PIG DATA TYPE MAPPINGMongoDB, Pig

JSON Pig

string chararrayinteger intboolean booleandouble doublearray bagobject map/tuplenull null

MONGODB-PIG DATA TYPE MAPPINGMongoDB, Pig

MongoDB Pig

date datetimeobject id chararraybinary data

bytearrayregexp chararraycode chararray

MortarFAST INTRO

Open-source code-based dev framework for data, built on Hadoop and Pig

Inspired by Rails

Self-contained, organized, executable projects

> gem install mortar

> mortar new my_project

MortarFAST INTRO

Our service hosts and executes mortar projects

> mortar jobs:run your_pigscript --clustersize 5

MortarFAST INTRO

Browser-only interface, great for demonstrating Hadoop

LOADING DATAMongoDB, Pig

One requirement:• Must specify top level fields to load from

the mongoDB collection.

Optional:• Specify a subset of embedded fields• Data type for any/all fields

LOADING DATA - ENRON DATAMongoDB, Pig

{    "body": "the ... person...",    "subFolder": "notes_inbox",    "mailbox": "bass-e",    "filename": "450.",    "headers": {        "From": "michael.simmons@enron.com",        "To": "tim_belden@enron.com", “Subject”: “Subject”        "Date": "Mon, 14 May 2001 16:39:00 -0700 (PDT)",    }}

SCRIPT DEMOMongoDB, Pig

STORE STATEMENTMongoDB, Pig

The MongoStorage function takes an optional list of arguments of two types:• A single set of keys to base updating on.

This has three options: None, update, or multi.

• Multiple indexes to ensure in the same format as db.col.ensureIndex().

ILLUSTRATEPig

Auto-select dataset

Exercise every execution path

Step-by-step execution

WHY ILLUSTRATEPig

Write correct code quickly

Understand others’ code

Test every execution path, every step

USER-DEFINED FUNCTIONS (UDF)Pig

Pig is like procedural SQL

UDFs for rich data manipulation

UDFs: Java-based language

We made Pig work with CPython (NumPy, etc)

WITHOUT MORTARMongoDB + Pig

Get the mongo-hadoop connector:http://github.com/mongodb/mongo-hadoop

SUMMARYMongoDB + Pig

Hadoop and friends are maturingMongoDB and Pig are philosophically alignedReading and writing to Pig is straightforwardOnce in Pig (Hadoop)• massive batch calcs / analytics possible • work is offloaded• external libraries available

top related