advanced analytics & statistics with mongodb

27
advanced analytics and statistics with mongodb http://precog.io John A. De Goes @jdegoes 04/30/2012 mongoDB

Upload: john-de-goes

Post on 01-Dec-2014

3.522 views

Category:

Technology


4 download

DESCRIPTION

Big data guru John A. De Goes, CTO of Precog, presents an overview of Quirrel, a high-level, statistically-oriented, open source query language designed for advanced analytics and statistics on large-scale JSON data sets. John discusses how the language can be used to solve a variety of common problems encountered by modern application developers, and then overviews ongoing efforts to port the language to MongoDB as part of a pure open source distribution.

TRANSCRIPT

Page 1: Advanced Analytics & Statistics with MongoDB

advanced analytics and statistics with mongodb

http://precog.io

John A. De Goes @jdegoes

04/30/2012

mongoDB

Page 2: Advanced Analytics & Statistics with MongoDB

what do you wantfrom your data?

mongoDB

Page 3: Advanced Analytics & Statistics with MongoDB

I want aggregatesI want to get and put data I want deep insight

data storage data intelligence

SQL

MongoDBQuery

Language

MongoDBAggregationFramework

???

mongoDB

Page 4: Advanced Analytics & Statistics with MongoDB

I want aggregatesI want to get and put data I want deep insight

SQL

MongoDBQuery

Language

MongoDBAggregationFramework

Map Reduce

data storage data intelligence

mongoDB

Page 5: Advanced Analytics & Statistics with MongoDB

function map() {    emit(1, // Or put a GROUP BY key here         {sum: this.value, // the field you want stats for          min: this.value,          max: this.value,          count:1,          diff: 0, // M2,n: sum((val-mean)^2)    });}

function reduce(key, values) {    var a = values[0]; // will reduce into here    for (var i=1/*!*/; i < values.length; i++){        var b = values[i]; // will merge 'b' into 'a'

        // temp helpers        var delta = a.sum/a.count - b.sum/b.count; // a.mean - b.mean        var weight = (a.count * b.count)/(a.count + b.count);                // do the reducing        a.diff += b.diff + delta*delta*weight;        a.sum += b.sum;        a.count += b.count;        a.min = Math.min(a.min, b.min);        a.max = Math.max(a.max, b.max);    }

    return a;}

function finalize(key, value){     value.avg = value.sum / value.count;    value.variance = value.diff / value.count;    value.stddev = Math.sqrt(value.variance);    return value;}

mongoDB

Page 6: Advanced Analytics & Statistics with MongoDB

what if there wereanother way?

mongoDB

Page 7: Advanced Analytics & Statistics with MongoDB

• Statistical query language for JSON data

• Purely declarative

• Implicitly parallel

• Inherently composable

introducing

mongoDB

Page 8: Advanced Analytics & Statistics with MongoDB

a taste of quirrelpageViews := //pageViews

bound := 1.5 * stdDev(pageViews.duration)

avg := mean(pageViews.duration)

lengthyPageViews :=  pageViews where pageViews.duration > (avg + bound)

lengthyPageViews.userId

mongoDB

Page 9: Advanced Analytics & Statistics with MongoDB

a taste of quirrelpageViews := //pageViews

bound := 1.5 * stdDev(pageViews.duration)

avg := mean(pageViews.duration)

lengthyPageViews :=  pageViews where pageViews.duration > (avg + bound)

lengthyPageViews.userId

Users who spend an unusually longtime looking at a page!

mongoDB

Page 10: Advanced Analytics & Statistics with MongoDB

quirrel in 10 minutes

mongoDB

Page 11: Advanced Analytics & Statistics with MongoDB

in Quirrel everything isa set of events

set-oriented

mongoDB

Page 12: Advanced Analytics & Statistics with MongoDB

an event is a JSON value paired with an identity

event

mongoDB

Page 13: Advanced Analytics & Statistics with MongoDB

quirrel> 1[1]

quirrel> true[true]

quirrel> {userId: 1239823, name: “John Doe”}[{userId: 1239823, name: “John Doe”}]

quirrel>1 + 2[3]

quirrel> sqrt(16) * 4 - 1 / 3[5]

(really) basic queries

mongoDB

Page 14: Advanced Analytics & Statistics with MongoDB

quirrel> //payments

[{"amount":5,"date":1329741127233,"recipients":["research","marketing"]}, ...]

quirrel> load(“/payments”)

[{"amount":5,"date":1329741127233,"recipients":["research","marketing"]}, ...]

loading data

mongoDB

Page 15: Advanced Analytics & Statistics with MongoDB

quirrel> payments := //payments | payments

[{"amount":5,"date":1329741127233,"recipients":["research","marketing"]}, ...]

quirrel> five := 5 | five * 2[10]

variables

mongoDB

Page 16: Advanced Analytics & Statistics with MongoDB

quirrel> //users.userId

[9823461231, 916727123, 23987183, ...]

quirrel> //payments.recipients[0]

["engineering","operations","research", ...]

filtered descent

mongoDB

Page 17: Advanced Analytics & Statistics with MongoDB

quirrel> count(//users)24185132

quirrel> mean(//payments.amount)87.39

quirrel> sum(//payments.amount)921541.29

quirrel> stdDev(//payments.amount)31.84

reductions

mongoDB

Page 18: Advanced Analytics & Statistics with MongoDB

identity matching

e1e2e3e4e5e6e7

e8e9

e10e11e12

ab?

*

?

a * b

mongoDB

Page 19: Advanced Analytics & Statistics with MongoDB

quirrel> orders := //orders  | orders.subTotal + | orders.subTotal * | orders.taxRate + | orders.shipping + orders.handling [153.54805, 152.7618, 80.38365, ...]

identity matching

mongoDB

Page 20: Advanced Analytics & Statistics with MongoDB

quirrel> payments.amount * 0.10[6.1, 27.842, 29.084, 50, 0.5, 16.955, ...]

values

mongoDB

Page 21: Advanced Analytics & Statistics with MongoDB

quirrel> users := //users  | segment := users.age > 19 &  | users.age < 53 & users.income > 60000  | count(users where segment)[15]

filtering

mongoDB

Page 22: Advanced Analytics & Statistics with MongoDB

pageViews := //pageViews

bound := 1.5 * stdDev(pageViews.duration)

avg := mean(pageViews.duration)

lengthyPageViews :=  pageViews where pageViews.duration > (avg + bound)

lengthyPageViews.userId

chaining

mongoDB

Page 23: Advanced Analytics & Statistics with MongoDB

quirrel> pageViews := //pageViews |  | statsForUser('userId) :=  |   {userId:  'userId,  | meanPageView: mean(pageViews.duration  | where pageViews.userId =  'userId)} |  | statsForUser

[{"userId":12353,"meanPageView":100.66666666666667},{"userId":12359,"meanPageView":83}, ...]

user functions

mongoDB

Page 24: Advanced Analytics & Statistics with MongoDB

• Cross-joins

• Self-joins

• Augmentation

• Power-packed standard library

lots more!

mongoDB

Page 25: Advanced Analytics & Statistics with MongoDB

• Quirrel is extremely expressive

• Aggregation framework insufficient

• Working with 10gen on new primitives

• Backup plan: AF + MapReduce

quirrel -> mongodb

mongoDB

Page 26: Advanced Analytics & Statistics with MongoDB

pageViews := //pageViews

bound := 1.5 * stdDev(pageViews.duration)

avg := mean(pageViews.duration)

lengthyPageViews :=  pageViews where pageViews.duration > (avg + bound)

lengthyPageViews.userId

quirrel -> mongodb

one-passmap/reduce

one-passmongo filter

mongoDB

Page 27: Advanced Analytics & Statistics with MongoDB

qaJohn A. De Goes @jdegoes

http://precog.io 04/30/2012

mongoDB