mongodb days uk: using mongodb to build a fast and scalable content repository sponsored by nuxeo
TRANSCRIPT
Using MongoDB to Build a Fast and Scalable Content Repository
Some ContextWhat we Do and What Problems We Try to Solve
Nuxeo Platform
• We provide a Platform that developers can use to build highly customised Content Applications
• We provide components, and the tools to assemble them
• The Platform is open source (https://github.com/nuxeo)
• Various customers - various use cases
• Me: Product Director at Nuxeo @aescaffre
Document Repository
• Document Oriented Database
➡ store JSON documents
• Document Repository
➡Manage Document attributes, hierarchy, blobs, security, lifecycle, versions
Document Repository
Storage abstraction:be able to choose the right storage:
• Depending on the constraints
• Depending on the environment
Document RepositoryA Nuxeo Platform document
Document RepositoryWith custom schemas:
Document RepositorySecurity on each record:
Document Repository
Thumbnail, Preview URLs
Get A Conversion (image, video, office, sound)
GET http://localhost:8080/nuxeo/api/v1/path/{docPath}/@convert?type=application%2Fpdf
Document Repository
•Start a Worklfow
curl -X POST 'http://localhost:8080/nuxeo/site/automation/Context.StartWorkflow' -H 'Accept: */*' -H 'Authorization: Basic QWRtaW5pc3RyYXRvcjpBZG1pbmlzdHJhdG9y’ -H 'content-type: application/json+nxrequest' -d '{"params":{"id":"serial-review","start":"true"},"input":"/default-domain/Passports/3719050812174596321","context":{}}'
•Do Some QueriesSELECT * FROM Document WHERE files/*1/file/name LIKE '%.txt' AND files/*1/file/length = 0
History : Nuxeo Repository And Storage
• 2006: Nuxeo Repository is based on ZODB (Python / Zope based):
• This is not JSON in NoSQL, but Python serialization in ObjectDB
• Concurrency and performances issues, Bad transaction handling2007: Nuxeo Platform 5.1 - Apache JackRabbit (JCR based)
• 2007: Nuxeo Platform 5.1 - Apache JackRabbit (JCR based)
• Mix SQL + Java Serialization + Lucene
• Transaction and consistency issues
• 2009: Nuxeo 5.2 - Nuxeo VCS
• SQL based repository : MVCC & ACID
• very reliable, but some use cases can not fit in a SQL DB !
• 2014: Nuxeo 5.9 - Nuxeo DBS
• Document Based Storage repository
• MongoDB is the reference backend
From SQL to NoSQLUnderstanding the motivations for moving to MongoDB
SQL Based Repository
KEY LIMITATIONS OF THE SQL APPROACH
• Impedance Issues
• storing Documents in tables is not easy
• requires Caching and Lazy loading
• Scalability
• Document repository can become very large (versions, workflows …)
• Scaling out SQL DB is very complex (and never transparent)
• Concurrency model
• Heavy write is an issue (Quotas, Inheritance)
• Heavy write is an issue (Quotas, Inheritance)
Need a Different Storage Model!
From SQL to NoSQL
NoSQL with MongoDB
• No Impedance Issue
‣ One Nuxeo Document = One MongoDB Document
• No Scalability Issue
‣ Native distributed architecture allows scale out
• No Concurrency Issue
‣ Document Level "Transactions"
• No Application Level Cache is Needed
‣ No need to manage invalidations
MongoDB IntegrationInside nuxeo-dbs storage adapter
Document base Storage & Mongodb
Storing Nuxeo Documents in MongoDB
Hierarchy
• Parent-child relationship: ecm:parentId
• Recursion optimised through ecm:ancestorIds array
• Maintained by the framework (create, delete, move, copy)
Security
• Generic ACP stored in ecm:acp field
• Precomputed Read ACLs to avoid post-filtering on search
• Simple Set of identities having access
• Semantic restriction on blocking
• Maintained by framework
• Search matches if intersection
Search
Consistency Challenges
• Unitary Document Operations Are Safe
• No impedance issue
• Large batch updates is not so much of an issue
• SQL DB do not like long running transactions anyway
• Multi-documents transactions are an issue
• Workflows is a typical use case
• Isolation issue
• Other transactions can see intermediate states
Mitigating Consistency Issues• Transient State Manager
• Run All Operations In Memory
• Flush to MongoDB as late as possible
• Populate an Undo log
• Replay backward in case of Rollback
➡ recover partial transaction management
Complete isolation not possible • Need to flush transient state for queries
• “uncommited” changes are visible to others
➡Read Uncommitted, at best
Typical Use Cases ForMongoDB
Huge Repository - Heavy Loading
• Massive Amount of Documents (X00,000,000+ docs)
➡ Retail DAM repository, Banks archiving repository (email), large B2C companies invoicing output
• Automatic and grape versioning: create a version for each single change
➡Pharmaceutical,financial, etc.
Huge Repository - Heavy Loading
• Massive Amount of Documents (X00,000,000+ docs)
➡ Retail DAM repository, Banks archiving repository (email), large B2C companies invoicing output
• Automatic and grape versioning: create a version for each single change
➡Pharmaceutical,financial, etc.
SQL DB collapses (on commodity hardware)
MongoDB handles the volume
Benchmarking Mass Import
Benchmarking Mass Import• Process 20000 documents
๏ 700 documents/s with SQL backend (cold cache)
๏ 6,000 documents/s with MongoDB / mmapv1: x9
๏ 11,000 documents/s with MongoDB / wiredTiger: x15
• Process 100000 documents
๏ 750 documents/s with SQL backend (cold cache)
๏ 9,500 documents/s with MongoDB / mmapv1: x9
๏ 11,500 documents/s with MongoDB / wiredTiger: x15
• Process 200000 documents
๏ 750 documents/s with SQL backend (cold cache)
๏ 14,000 documents/s with MongoDB / mmapv1: x9
๏ 11,000 documents/s with MongoDB / wiredTiger: x15
Benchmarking Scale Out
• 1 Nuxeo node + 1 MongoDB node
• 1900 docs/s
• MongoDB CPU is the bottleneck (800%)
• 2 Nuxeo nodes + 1 MongoDB node
• 1850 docs/s
• MongoDB CPU is the bottleneck (800%)
• 2 Nuxeo nodes + 2 MongoDB nodes
• 3400 docs/s when using read preferences
Adding one MongoDB node adds 80% throughput
Geo-distributed Architecture
A Real Life Exemple
Context
• Who: US Network Carier
• Goal: Provide VOD Services
• Requirements:
• store videos
• manage metadata
• manage workflows
• generate thumbs
• generate conversions
• manage availability
Nuxeo Platform as
a videos repository
Challenges
• Very Large Objects:
• Lots of metadata (dublincore, ADI, ratings)
• Massive Daily Updates
• Updates On Rights and Availability
• Need To Track All Changes
• Prove what was the availability for a given date
Lots of data + lots of updates ➡ db.createCollection(“myMovies”)
MongoDB Choice
• They chose MongoDB
• because they have a good use case for MongoDB
• because they wanted to use MongoDB
• change work habits (Open source, NoSQL)
• doing a project with MongoDB is cool!!