buildingsocialanalyticstoolwithmongodb
TRANSCRIPT
Building Social Analy/cs Tool with MongoDB -‐ A Developer's Perspec/ve
1. Product Overview
2. Why MongoDB for us?
3. Aggrega?on Queries to the rescue
4. How Javascript helped us?
5. Experiences with Indexes
6. In-‐progress use-‐cases
7. Tips & Tricks
8. Demo
Agenda
Abhishek Tejpaul SoUware Developer @ IntelliGrape SoUware Loves Grails, Git and Linux [email protected]
About me
DataSiU
Web Crawler1
Web Crawler…
mongoDB
Product Overview – Information Flow
Product Overview – Results
Product Overview – Results
Product Overview – Results
• Schema-‐less data. Typical data sources
• Adding new social pla4orms in future
• Needed fast read-‐write opera6ons
Why MongoDB for us?
Aggregation Queries – Getting Insights • Combina6on of queries chained together
• At every stage, we can filter/chain/massage data
Image credit: h@ps://www.openshiC.com/blogs/an-‐overview-‐of-‐whats-‐new-‐in-‐mongodb-‐22
Our use-case (esp. for graphs)
• Sen6ment Analysis
• Demographic Analysis
• Ar6cle Analysis
• Plan • Crea?on of Intelligence tables in advance
• Reality • On-‐the-‐fly analysis using Aggrega6on queries
How to go about it? • Operates on a single collec6on
• Think about data you have and insights you want
• Focus on reducing data size early on • $match • $project • $sort • $limit, $skip
• Example db.collec?onName.aggregate(
{ "$match" : { fieldName : matchingValue }, { "$project" : { oldOrNewField: fieldValue }}, { "$group" : { fieldName : oldOrNewField, "sum": {"$sum":1}}}, { "$sort" : { "sum" : -‐1 }}, { "$limit" : 20 })
Javascript Capabilities
• All the programming capabili6es of Javascript language at your
disposal
• Taking business logic / processing to your data-‐store
Javascript – Our use-cases
• Remove garbage data at DB level
• Twijer wrong results • Filtering out STOP keywords
db.IgnoreList.findOne().stopWords.forEach( func?on(data) { db.ProcessedAr?cle.update( { "isAc?ve" : true, "isIgnored" : {"\$ne":true} }, { "\$pull" : {"topicOfDiscussion" : {"name": data}}, "\$set" : {"isIgnored" : true} }, { "mul?" : true } ) }); return true
Javascript – Caveats
• Takes up read-‐write locks on the en6re database • Can be run with {‘noLock’ : true} op?on
db.runCommand({
Eval: <func?on>, Args: <args>,
Nolock: <true/false> })
• Can be replaced by mapreduce in most cases • Take it as one-‐off case
Indexes – Our use-cases
• dropDups {dropDups : true}
• backGround {backGround : true}
• Time to Live
{expireAUerSeconds : 3600}
• Compound Indexing
{key1 : 1, key2 : 1} != {key1 : 1}
Our current state
• Faster write opera?ons • Under high data load from different sources
• Faster read opera?ons • Graph rendering up-‐to 10 x quicker
• Ease of scalability • Though yet to reach there
Work In Progress
• Full-‐text search implementa?on
• can be created only on strings or array of strings
• db.collec?onName.ensureIndex( { fieldName : "text" } )
• Capped Collec?ons • Widgets for last-‐run jobs / event log tables
• Very fast writes possible
• db.createCollec?on("cName", { capped : true, size : 5242880,
max : 5000 } )
• size argument is always required
Tips / Tricks – Things we learnt
• cloneCollec6on • No more ssh/scp to remote systems • db.runCommand({cloneCollec?on: <nsCollec?on>, from: <remote>, query: {}})
• db.cloneCollec?on(from, collec?onName, query)
• db.Collec-onName.copyTo
• doesn’t not copy indexes
Tips / Tricks – Things we learnt
• remove() vs drop()
• Can’t use remove for capped collec6ons
• remove keeps indexes while drop() clears them
• To remove all the documents in a collec?on, use drop()
• To remove beZer part of large collec?on, use javascript
• preZy() find by default • DBQuery.prototype._prejyShell = true ( inside your ~/.mongorc.js)
DEMO
I am not a MongoDB expert though J
Thank You!!