building a directed graph with mongodb
DESCRIPTION
Details of how Wordnik built a directed graph on top of MongoDB. This is the presentation given during MongoSF 2011 by Tony Tam.TRANSCRIPT
BUILDING A DIRECTED GRAPH WITH MONGODB
MongoSF 5/24/2011
By Tony Tam @fehguy
WHO IS WORDNIK
Word + Meaning Discovery EngineClustered Application built with:
Scala/Java/JettyOnly way in is via REST
19M API calls/day @ 7ms/query averagePhysical servers
72GB RAM, 8 core4.3TB DAS
We’re MongoDB users for ~1.5 yrsUsed in master/slave14B documents in MongoDB
WHY A GRAPH FOR WORDS
Technique to model network relationshipsProperties are dynamicLinks are “arbitrary”
Runtime performanceAnswers in < 5ms/request
Routing functions based on goals“find most likely word for X”“find more common form of Y”
WHY A GRAPH FOR WORDS
Misspellings, abbreviations, texting, Twitter
MORE ABOUT GRAPHS
Different types of GraphsDecisions have huge impact on design +
implementationNodes (vertices)
String and numeric propertiesEdges (links)
Finite set of labeled edge types (~30)Multiple target nodes per edge
Each potentially different weightDirected, non-symmetrical
WHY BUILD ON MONGODB?
Word Graph is core to WordnikMany ways to build a graph
Dedicated graph DBsRelational DBs
MongoDB Document StorageUber-flexibleSuccessfully routes in < 5msLong runway for scale-outLimit storage infrastructure componentsEasy to implement
WORDNIK GRAPH DATA MODEL
Nodes_id field holds name, object type
Index at no extra costArbitrary number of properties
Only two datatypes for us, String, DoubleNode type info in node ID (_id)
na_corpusCount => Double sa_source => String
WORDNIK GRAPH DATA MODEL
EdgesDestination(s)
WeightLink Properties
Stored in Mongo ArraysArray size is app limited
Use $push, $pop
ACCESS TO MONGO
Mongo Access via DAO layerLimit queries to ones that work “well”
ALL queries use indexFind Node “cat” of type “word”:
db.node.findOne({_id:"cat|word"})Find Edge types for above:
db.edge.find({_id:/^cat\|word\|/},{_id:1})
Serialization/deserialization Done “the old-fashioned way” BasicDBObject, BasicDBList faster than mappers for
our use case
QUERY EFFICIENCY
Max execution time is f (ahops)
ROUTING, TRAVERSALS, FUNCTIONS
Typically find path from A to BRoutes have costs
Low cost or high probabilityOur use case is atypical
LinkedIn vs. MapsNot from A to B
More like “from A with 3 hops”This matters!
PERFORMANCE + SCALING
PERFORMANCE + SCALING
Query by index onlyUse regex syntax in restricted fashion
Starts with onlyNo look behindCase sensitive
Boring? Fast?Sharding is a no-brainer
What about ObjectId()?
PERFORMANCE + SCALING
Horizontal? Vertical? Both? And when?Separate collections by edge type/object type
Increases storage needs Collections all have padding, 30 collections => ~30x padding
ShardingUse slick, built-in Mongo shardingRoll your own based on your data
What does Wordnik do?Neither! (yet)30M Nodes, 50M Edges
One collection for nodesOne collection for edges
PERFORMANCE + SCALING
Selecting a shard keyDone in application logic based on OUR dataDepends on what you need
END RESULT
Solves Wordnik Graph infrastructure needsStore Word nodes with UGC, corpus,
structured, analytical dataBatch fetch Edges @ > 50k/secondFind Edge + endpoints in 80mS
Powers our…Word Selection
CanonicalizationMisspelling“Did you mean” logic
Classification + Matching Engine
EXAMPLES
Misspellings
Abbreviations
Lemmatization
EXAMPLES
Term normalizationFind similar words
Meaning normalizationFind “more common” form
EXAMPLES
Applied Word GraphRecall:
“Computers are stupid”English is complex
Clustering + classification algorithms:Stink without consistent data
“The” => “the” (duh) “geese” => “goose” (ok)
Stink when they’re slow
Graph + Clustering/ClassificationJust add data
MONGODB MAKES A GREAT GRAPH BACK-END
See more about Wordnik APIs:
http://developer.wordnik.com Further Reading
Migrating from MySQL to MongoDBhttp://www.slideshare.net/fehguy/migrating-from-mysql-to-mongodb-at-wordnik
Maintaining your MongoDB Installationhttp://www.slideshare.net/fehguy/mongo-sv-tony-tam
Source CodeMapping Benchmark
https://github.com/fehguy/mongodb-benchmark-tools
Wordnik OSS Tools https://github.com/wordnik/wordnik-oss
MONGODB MAKES A GREAT GRAPH BACK-END
Questions?