building a directed graph with mongodb

21
BUILDING A DIRECTED GRAPH WITH MONGODB MongoSF 5/24/2011 By Tony Tam @fehguy

Upload: tony-tam

Post on 11-May-2015

27.743 views

Category:

Technology


6 download

DESCRIPTION

Details of how Wordnik built a directed graph on top of MongoDB. This is the presentation given during MongoSF 2011 by Tony Tam.

TRANSCRIPT

Page 1: Building a Directed Graph with MongoDB

BUILDING A DIRECTED GRAPH WITH MONGODB

MongoSF 5/24/2011

By Tony Tam @fehguy

Page 2: Building a Directed Graph with MongoDB

WHO IS WORDNIK

Word + Meaning Discovery EngineClustered Application built with:

Scala/Java/JettyOnly way in is via REST

19M API calls/day @ 7ms/query averagePhysical servers

72GB RAM, 8 core4.3TB DAS

We’re MongoDB users for ~1.5 yrsUsed in master/slave14B documents in MongoDB

Page 3: Building a Directed Graph with MongoDB

WHY A GRAPH FOR WORDS

Technique to model network relationshipsProperties are dynamicLinks are “arbitrary”

Runtime performanceAnswers in < 5ms/request

Routing functions based on goals“find most likely word for X”“find more common form of Y”

Page 4: Building a Directed Graph with MongoDB

WHY A GRAPH FOR WORDS

Misspellings, abbreviations, texting, Twitter

Page 5: Building a Directed Graph with MongoDB

MORE ABOUT GRAPHS

Different types of GraphsDecisions have huge impact on design +

implementationNodes (vertices)

String and numeric propertiesEdges (links)

Finite set of labeled edge types (~30)Multiple target nodes per edge

Each potentially different weightDirected, non-symmetrical

Page 6: Building a Directed Graph with MongoDB

WHY BUILD ON MONGODB?

Word Graph is core to WordnikMany ways to build a graph

Dedicated graph DBsRelational DBs

MongoDB Document StorageUber-flexibleSuccessfully routes in < 5msLong runway for scale-outLimit storage infrastructure componentsEasy to implement

Page 7: Building a Directed Graph with MongoDB

WORDNIK GRAPH DATA MODEL

Nodes_id field holds name, object type

Index at no extra costArbitrary number of properties

Only two datatypes for us, String, DoubleNode type info in node ID (_id)

na_corpusCount => Double sa_source => String

Page 8: Building a Directed Graph with MongoDB

WORDNIK GRAPH DATA MODEL

EdgesDestination(s)

WeightLink Properties

Stored in Mongo ArraysArray size is app limited

Use $push, $pop

Page 9: Building a Directed Graph with MongoDB

ACCESS TO MONGO

Mongo Access via DAO layerLimit queries to ones that work “well”

ALL queries use indexFind Node “cat” of type “word”:

db.node.findOne({_id:"cat|word"})Find Edge types for above:

db.edge.find({_id:/^cat\|word\|/},{_id:1})

Serialization/deserialization Done “the old-fashioned way” BasicDBObject, BasicDBList faster than mappers for

our use case

Page 10: Building a Directed Graph with MongoDB

QUERY EFFICIENCY

Max execution time is f (ahops)

Page 11: Building a Directed Graph with MongoDB

ROUTING, TRAVERSALS, FUNCTIONS

Typically find path from A to BRoutes have costs

Low cost or high probabilityOur use case is atypical

LinkedIn vs. MapsNot from A to B

More like “from A with 3 hops”This matters!

Page 12: Building a Directed Graph with MongoDB

PERFORMANCE + SCALING

Page 13: Building a Directed Graph with MongoDB

PERFORMANCE + SCALING

Query by index onlyUse regex syntax in restricted fashion

Starts with onlyNo look behindCase sensitive

Boring? Fast?Sharding is a no-brainer

What about ObjectId()?

Page 14: Building a Directed Graph with MongoDB

PERFORMANCE + SCALING

Horizontal? Vertical? Both? And when?Separate collections by edge type/object type

Increases storage needs Collections all have padding, 30 collections => ~30x padding

ShardingUse slick, built-in Mongo shardingRoll your own based on your data

What does Wordnik do?Neither! (yet)30M Nodes, 50M Edges

One collection for nodesOne collection for edges

Page 15: Building a Directed Graph with MongoDB

PERFORMANCE + SCALING

Selecting a shard keyDone in application logic based on OUR dataDepends on what you need

Page 16: Building a Directed Graph with MongoDB

END RESULT

Solves Wordnik Graph infrastructure needsStore Word nodes with UGC, corpus,

structured, analytical dataBatch fetch Edges @ > 50k/secondFind Edge + endpoints in 80mS

Powers our…Word Selection

CanonicalizationMisspelling“Did you mean” logic

Classification + Matching Engine

Page 17: Building a Directed Graph with MongoDB

EXAMPLES

Misspellings

Abbreviations

Lemmatization

Page 18: Building a Directed Graph with MongoDB

EXAMPLES

Term normalizationFind similar words

Meaning normalizationFind “more common” form

Page 19: Building a Directed Graph with MongoDB

EXAMPLES

Applied Word GraphRecall:

“Computers are stupid”English is complex

Clustering + classification algorithms:Stink without consistent data

“The” => “the” (duh) “geese” => “goose” (ok)

Stink when they’re slow

Graph + Clustering/ClassificationJust add data

Page 20: Building a Directed Graph with MongoDB

MONGODB MAKES A GREAT GRAPH BACK-END

See more about Wordnik APIs:

http://developer.wordnik.com Further Reading

Migrating from MySQL to MongoDBhttp://www.slideshare.net/fehguy/migrating-from-mysql-to-mongodb-at-wordnik

Maintaining your MongoDB Installationhttp://www.slideshare.net/fehguy/mongo-sv-tony-tam

Source CodeMapping Benchmark

https://github.com/fehguy/mongodb-benchmark-tools

Wordnik OSS Tools https://github.com/wordnik/wordnik-oss

Page 21: Building a Directed Graph with MongoDB

MONGODB MAKES A GREAT GRAPH BACK-END

Questions?