advanced analytics & statistics with mongodb
DESCRIPTION
Big data guru John A. De Goes, CTO of Precog, presents an overview of Quirrel, a high-level, statistically-oriented, open source query language designed for advanced analytics and statistics on large-scale JSON data sets. John discusses how the language can be used to solve a variety of common problems encountered by modern application developers, and then overviews ongoing efforts to port the language to MongoDB as part of a pure open source distribution.TRANSCRIPT
advanced analytics and statistics with mongodb
http://precog.io
John A. De Goes @jdegoes
04/30/2012
mongoDB
what do you wantfrom your data?
mongoDB
I want aggregatesI want to get and put data I want deep insight
data storage data intelligence
SQL
MongoDBQuery
Language
MongoDBAggregationFramework
???
mongoDB
I want aggregatesI want to get and put data I want deep insight
SQL
MongoDBQuery
Language
MongoDBAggregationFramework
Map Reduce
data storage data intelligence
mongoDB
function map() { emit(1, // Or put a GROUP BY key here {sum: this.value, // the field you want stats for min: this.value, max: this.value, count:1, diff: 0, // M2,n: sum((val-mean)^2) });}
function reduce(key, values) { var a = values[0]; // will reduce into here for (var i=1/*!*/; i < values.length; i++){ var b = values[i]; // will merge 'b' into 'a'
// temp helpers var delta = a.sum/a.count - b.sum/b.count; // a.mean - b.mean var weight = (a.count * b.count)/(a.count + b.count); // do the reducing a.diff += b.diff + delta*delta*weight; a.sum += b.sum; a.count += b.count; a.min = Math.min(a.min, b.min); a.max = Math.max(a.max, b.max); }
return a;}
function finalize(key, value){ value.avg = value.sum / value.count; value.variance = value.diff / value.count; value.stddev = Math.sqrt(value.variance); return value;}
mongoDB
what if there wereanother way?
mongoDB
• Statistical query language for JSON data
• Purely declarative
• Implicitly parallel
• Inherently composable
introducing
mongoDB
a taste of quirrelpageViews := //pageViews
bound := 1.5 * stdDev(pageViews.duration)
avg := mean(pageViews.duration)
lengthyPageViews := pageViews where pageViews.duration > (avg + bound)
lengthyPageViews.userId
mongoDB
a taste of quirrelpageViews := //pageViews
bound := 1.5 * stdDev(pageViews.duration)
avg := mean(pageViews.duration)
lengthyPageViews := pageViews where pageViews.duration > (avg + bound)
lengthyPageViews.userId
Users who spend an unusually longtime looking at a page!
mongoDB
quirrel in 10 minutes
mongoDB
in Quirrel everything isa set of events
set-oriented
mongoDB
an event is a JSON value paired with an identity
event
mongoDB
quirrel> 1[1]
quirrel> true[true]
quirrel> {userId: 1239823, name: “John Doe”}[{userId: 1239823, name: “John Doe”}]
quirrel>1 + 2[3]
quirrel> sqrt(16) * 4 - 1 / 3[5]
(really) basic queries
mongoDB
quirrel> //payments
[{"amount":5,"date":1329741127233,"recipients":["research","marketing"]}, ...]
quirrel> load(“/payments”)
[{"amount":5,"date":1329741127233,"recipients":["research","marketing"]}, ...]
loading data
mongoDB
quirrel> payments := //payments | payments
[{"amount":5,"date":1329741127233,"recipients":["research","marketing"]}, ...]
quirrel> five := 5 | five * 2[10]
variables
mongoDB
quirrel> //users.userId
[9823461231, 916727123, 23987183, ...]
quirrel> //payments.recipients[0]
["engineering","operations","research", ...]
filtered descent
mongoDB
quirrel> count(//users)24185132
quirrel> mean(//payments.amount)87.39
quirrel> sum(//payments.amount)921541.29
quirrel> stdDev(//payments.amount)31.84
reductions
mongoDB
identity matching
e1e2e3e4e5e6e7
e8e9
e10e11e12
ab?
*
?
a * b
mongoDB
quirrel> orders := //orders | orders.subTotal + | orders.subTotal * | orders.taxRate + | orders.shipping + orders.handling [153.54805, 152.7618, 80.38365, ...]
identity matching
mongoDB
quirrel> payments.amount * 0.10[6.1, 27.842, 29.084, 50, 0.5, 16.955, ...]
values
mongoDB
quirrel> users := //users | segment := users.age > 19 & | users.age < 53 & users.income > 60000 | count(users where segment)[15]
filtering
mongoDB
pageViews := //pageViews
bound := 1.5 * stdDev(pageViews.duration)
avg := mean(pageViews.duration)
lengthyPageViews := pageViews where pageViews.duration > (avg + bound)
lengthyPageViews.userId
chaining
mongoDB
quirrel> pageViews := //pageViews | | statsForUser('userId) := | {userId: 'userId, | meanPageView: mean(pageViews.duration | where pageViews.userId = 'userId)} | | statsForUser
[{"userId":12353,"meanPageView":100.66666666666667},{"userId":12359,"meanPageView":83}, ...]
user functions
mongoDB
• Cross-joins
• Self-joins
• Augmentation
• Power-packed standard library
lots more!
mongoDB
• Quirrel is extremely expressive
• Aggregation framework insufficient
• Working with 10gen on new primitives
• Backup plan: AF + MapReduce
quirrel -> mongodb
mongoDB
pageViews := //pageViews
bound := 1.5 * stdDev(pageViews.duration)
avg := mean(pageViews.duration)
lengthyPageViews := pageViews where pageViews.duration > (avg + bound)
lengthyPageViews.userId
quirrel -> mongodb
one-passmap/reduce
one-passmongo filter
mongoDB