advanced analytics & statistics with mongodb

Post on 01-Dec-2014

3.522 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

Big data guru John A. De Goes, CTO of Precog, presents an overview of Quirrel, a high-level, statistically-oriented, open source query language designed for advanced analytics and statistics on large-scale JSON data sets. John discusses how the language can be used to solve a variety of common problems encountered by modern application developers, and then overviews ongoing efforts to port the language to MongoDB as part of a pure open source distribution.

TRANSCRIPT

advanced analytics and statistics with mongodb

http://precog.io

John A. De Goes @jdegoes

04/30/2012

mongoDB

what do you wantfrom your data?

mongoDB

I want aggregatesI want to get and put data I want deep insight

data storage data intelligence

SQL

MongoDBQuery

Language

MongoDBAggregationFramework

???

mongoDB

I want aggregatesI want to get and put data I want deep insight

SQL

MongoDBQuery

Language

MongoDBAggregationFramework

Map Reduce

data storage data intelligence

mongoDB

function map() {    emit(1, // Or put a GROUP BY key here         {sum: this.value, // the field you want stats for          min: this.value,          max: this.value,          count:1,          diff: 0, // M2,n: sum((val-mean)^2)    });}

function reduce(key, values) {    var a = values[0]; // will reduce into here    for (var i=1/*!*/; i < values.length; i++){        var b = values[i]; // will merge 'b' into 'a'

        // temp helpers        var delta = a.sum/a.count - b.sum/b.count; // a.mean - b.mean        var weight = (a.count * b.count)/(a.count + b.count);                // do the reducing        a.diff += b.diff + delta*delta*weight;        a.sum += b.sum;        a.count += b.count;        a.min = Math.min(a.min, b.min);        a.max = Math.max(a.max, b.max);    }

    return a;}

function finalize(key, value){     value.avg = value.sum / value.count;    value.variance = value.diff / value.count;    value.stddev = Math.sqrt(value.variance);    return value;}

mongoDB

what if there wereanother way?

mongoDB

• Statistical query language for JSON data

• Purely declarative

• Implicitly parallel

• Inherently composable

introducing

mongoDB

a taste of quirrelpageViews := //pageViews

bound := 1.5 * stdDev(pageViews.duration)

avg := mean(pageViews.duration)

lengthyPageViews :=  pageViews where pageViews.duration > (avg + bound)

lengthyPageViews.userId

mongoDB

a taste of quirrelpageViews := //pageViews

bound := 1.5 * stdDev(pageViews.duration)

avg := mean(pageViews.duration)

lengthyPageViews :=  pageViews where pageViews.duration > (avg + bound)

lengthyPageViews.userId

Users who spend an unusually longtime looking at a page!

mongoDB

quirrel in 10 minutes

mongoDB

in Quirrel everything isa set of events

set-oriented

mongoDB

an event is a JSON value paired with an identity

event

mongoDB

quirrel> 1[1]

quirrel> true[true]

quirrel> {userId: 1239823, name: “John Doe”}[{userId: 1239823, name: “John Doe”}]

quirrel>1 + 2[3]

quirrel> sqrt(16) * 4 - 1 / 3[5]

(really) basic queries

mongoDB

quirrel> //payments

[{"amount":5,"date":1329741127233,"recipients":["research","marketing"]}, ...]

quirrel> load(“/payments”)

[{"amount":5,"date":1329741127233,"recipients":["research","marketing"]}, ...]

loading data

mongoDB

quirrel> payments := //payments | payments

[{"amount":5,"date":1329741127233,"recipients":["research","marketing"]}, ...]

quirrel> five := 5 | five * 2[10]

variables

mongoDB

quirrel> //users.userId

[9823461231, 916727123, 23987183, ...]

quirrel> //payments.recipients[0]

["engineering","operations","research", ...]

filtered descent

mongoDB

quirrel> count(//users)24185132

quirrel> mean(//payments.amount)87.39

quirrel> sum(//payments.amount)921541.29

quirrel> stdDev(//payments.amount)31.84

reductions

mongoDB

identity matching

e1e2e3e4e5e6e7

e8e9

e10e11e12

ab?

*

?

a * b

mongoDB

quirrel> orders := //orders  | orders.subTotal + | orders.subTotal * | orders.taxRate + | orders.shipping + orders.handling [153.54805, 152.7618, 80.38365, ...]

identity matching

mongoDB

quirrel> payments.amount * 0.10[6.1, 27.842, 29.084, 50, 0.5, 16.955, ...]

values

mongoDB

quirrel> users := //users  | segment := users.age > 19 &  | users.age < 53 & users.income > 60000  | count(users where segment)[15]

filtering

mongoDB

pageViews := //pageViews

bound := 1.5 * stdDev(pageViews.duration)

avg := mean(pageViews.duration)

lengthyPageViews :=  pageViews where pageViews.duration > (avg + bound)

lengthyPageViews.userId

chaining

mongoDB

quirrel> pageViews := //pageViews |  | statsForUser('userId) :=  |   {userId:  'userId,  | meanPageView: mean(pageViews.duration  | where pageViews.userId =  'userId)} |  | statsForUser

[{"userId":12353,"meanPageView":100.66666666666667},{"userId":12359,"meanPageView":83}, ...]

user functions

mongoDB

• Cross-joins

• Self-joins

• Augmentation

• Power-packed standard library

lots more!

mongoDB

• Quirrel is extremely expressive

• Aggregation framework insufficient

• Working with 10gen on new primitives

• Backup plan: AF + MapReduce

quirrel -> mongodb

mongoDB

pageViews := //pageViews

bound := 1.5 * stdDev(pageViews.duration)

avg := mean(pageViews.duration)

lengthyPageViews :=  pageViews where pageViews.duration > (avg + bound)

lengthyPageViews.userId

quirrel -> mongodb

one-passmap/reduce

one-passmongo filter

mongoDB

qaJohn A. De Goes @jdegoes

http://precog.io 04/30/2012

mongoDB

top related