![Page 1: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/1.jpg)
Solutions Architect, 10gen
Antoine Girbal
#antoinegirbal
Understanding Storage for performance and data safety
![Page 2: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/2.jpg)
Why pop the hood?• Understanding data safety
• Estimating RAM / disk requirements
• Optimizing performance
![Page 3: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/3.jpg)
Storage Layout
![Page 4: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/4.jpg)
drwxr-xr-x 4 antoine wheel 136 Nov 19 10:12 journal-rw------- 1 antoine wheel 16777216 Oct 25 14:58 test.0-rw------- 1 antoine wheel 134217728 Mar 13 2012 test.1-rw------- 1 antoine wheel 268435456 Mar 13 2012 test.2-rw------- 1 antoine wheel 536870912 May 11 2012 test.3-rw------- 1 antoine wheel 1073741824 May 11 2012 test.4-rw------- 1 antoine wheel 2146435072 Nov 19 10:14 test.5-rw------- 1 antoine wheel 16777216 Nov 19 10:13 test.ns
Directory Layout
![Page 5: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/5.jpg)
Directory Layout
• Each database has one or more data files, all in same folder (e.g. test.0, test.1, …)
• Aggressive preallocation (always 1 spare file)
• Those files get larger and larger, up to 2GB
• There is one namespace file per db which can hold 24000 entries per default. A namespace is a collection or an index.
• The journal folder contains the journal files
![Page 6: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/6.jpg)
Tuning with options
• Use --directoryperdb to separate dbs into own folders which allows to use different volumes (isolation, performance)
• Use --nopreallocate to prevent preallocation
• Use --smallfiles to keep data files smaller
• If using many databases, use –nopreallocate and --smallfiles to reduce storage size
• If using thousands of collections & indexes, increase namespace capacity with --nssize
![Page 7: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/7.jpg)
Internal Structure
![Page 8: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/8.jpg)
Internal File Format
• Files on disk are broken into extents which contain the documents
• A collection has 1 to many extents
• Extent grow exponentially up to 2GB
• Namespace entries in the ns file point to the first extent for that collection
![Page 9: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/9.jpg)
test.0 test.1 test.2
Internal File Formattest.ns Namespaces
Extents
Data Files
![Page 10: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/10.jpg)
Extent Structure
Extentlength
xNext
xPrev
firstRecord
lastRecord
Extentlength
xNext
xPrev
firstRecord
lastRecord
![Page 11: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/11.jpg)
Extents and Records
Extentlength
xNext
xPrev
firstRecord
lastRecord
Data Recordlength
rNext
rPrev
Document
{ _id: “foo”, ... }
Data Recordlength
rNext
rPrev
Document
{ _id: “bar”, ... }
![Page 12: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/12.jpg)
What about indices?
![Page 13: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/13.jpg)
Indexes
• Indexes are BTree structures serialized to disk
• They are stored in the same files as data but using own extents
![Page 14: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/14.jpg)
Index Extents
Extentlength
xNext
xPrev
firstRecord
lastRecord
Index Record
Bucketparent
numKeys
K
length
rNext
rPrev K KK
Index Record
Bucketparent
numKeys
K
length
rNext
rPrev K KK
{ Document }
4 9
1 3 5 6 8 A B
![Page 15: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/15.jpg)
> db.stats(){
"db" : "test","collections" : 22,"objects" : 17000383, ## number of documents"avgObjSize" : 44.33690276272011,"dataSize" : 753744328, ## size of data"storageSize" : 1159569408, ## size of all
containing extents"numExtents" : 81,"indexes" : 85,"indexSize" : 624204896, ## separate index
storage size"fileSize" : 4176478208, ## size of data files on
disk"nsSizeMB" : 16,"ok" : 1
}
the db stats
![Page 16: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/16.jpg)
> db.large.stats(){
"ns" : "test.large","count" : 5000000, ## number of documents"size" : 280000024, ## size of data"avgObjSize" : 56.0000048,"storageSize" : 409206784, ## size of all
containing extents"numExtents" : 18,"nindexes" : 1,"lastExtentSize" : 74846208,"paddingFactor" : 1, ## amount of padding"systemFlags" : 0,"userFlags" : 0,"totalIndexSize" : 162228192, ## separate index
storage size"indexSizes" : {
"_id_" : 162228192},"ok" : 1
}
the collection stats
![Page 17: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/17.jpg)
What’s memory mapping?
![Page 18: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/18.jpg)
Memory Mapped Files
• All data files are memory mapped to RAM by the OS
• Mongo just reads / writes to RAM in the filesystem cache
• OS takes care of the rest!
• Mongo calls fsync every 60 seconds to flush changes to disk
• Virtual process size = total files size + overhead (connections, heap)
• If journal is on, the virtual size will be roughly doubled
![Page 19: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/19.jpg)
Virtual Address Space
32-bit System
232 = 4GB
- 1GB kernel
- .5GB binaries, stack, etc.
= 2.5GB for data
BAD
64-bit System
264 = 1.7 x 1010 GB (16EB)?
0xF0 – 0xFF Kernel
0x00 – 0x7F User
247 = 128TB for data
GOOD
![Page 20: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/20.jpg)
Virtual Address Space
Kernel
STACK…
LIBS
…
test.ns
test.0
test.1
…
…HEAP
MONGOD
NULL
0x7fffffffffff
0x0
{ … }
Disk
DocumentProcess Virtual
Memory
![Page 21: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/21.jpg)
Memory map, love it or hate it• Pros:
– No complex memory / disk code in MongoDB, huge win!
– The OS is very good at caching for any type of storage
– Pure Least Recently Used behavior– Cache stays warm between Mongo restart
• Cons:– RAM is affected by disk fragmentation– RAM is affected by high read-ahead– LRU behavior does not prioritize things (like
indices)
![Page 22: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/22.jpg)
How much data is in RAM?• Resident memory the best indicator of
how much data in RAM
• Resident is: process overhead (connections, heap) + FS pages in RAM that were accessed
• Means that it resets to 0 upon restart even though data is still in RAM due to FS cache
• Use free command to check on FS cache size
• Can be affected by fragmentation and read-ahead
![Page 23: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/23.jpg)
Journaling
![Page 24: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/24.jpg)
The problem• A single insert/update involves writing to many
places (the record, indexes, ns details..)
• What if the electricity goes out? Corruption…
![Page 25: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/25.jpg)
Solution – use a journal
• Data gets written to a journal before making it to the data files
• Operations written to a journal buffer in RAM that gets flushed every 100ms or 100MB
• Once journal written to disk, data safe unless hardware entirely fails
• Journal prevents corruption and allows durability
• Can be turned off, but don’t!
![Page 26: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/26.jpg)
• Section contains single group commit
• Applied all-or-nothing
Journal FormatJHeader
JSectHeader [LSN 3]
DurOp
DurOp
DurOp
JSectFooter
JSectHeader [LSN 7]
DurOp
DurOp
DurOp
JSectFooter
…
Op_DbContext
lengthoffsetfileNodata[length]
lengthoffsetfileNodata[length]
lengthoffsetfileNodata[length]
Write Operation
Set database context for subsequent operations
![Page 27: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/27.jpg)
Can I lose data on hard crash?• Maximum data loss is 100ms (journal flush). This
can be reduced with –journalCommitInterval
• For durability (data is on disk when ack’ed) use the JOURNAL_SAFE write concern (“j” option).
• Note that replication can reduce the data loss further. Use the REPLICAS_SAFE write concern (“w” option).
• As write guarantees increase, latency increases. To maintain performance, use more connections!
![Page 28: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/28.jpg)
What is cost of journal?
• On read-heavy systems, no impact
• Write performance is reduced by 5-30%
• If using separate drive for journal, as low as 3%
• For apps that are write-heavy (1000+ writes per server) there can be slowdown due to mix of journal and data flushes. Use a separate drive!
![Page 29: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/29.jpg)
Fragmentation
![Page 30: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/30.jpg)
Fragmentation
• Files can get fragmented over time if remove() and update() are issued.
• It gets worse if documents have varied sizes
• Fragmentation wastes disk space and RAM
• Also makes writes scattered and slower
• Fragmentation can be checked by comparing size to storageSize in the collection’s stats.
![Page 31: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/31.jpg)
How it looks like
EXTENT
Doc Doc Doc X Doc X
Doc X Doc Doc X
…
BOTH ON DISK AND IN RAM!
![Page 32: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/32.jpg)
How to combat fragmentation?• compact command (maintenance op)
• Normalize schema more (documents don’t grow)
• Pre-pad documents (documents don’t grow)
• Use separate collections over time, then use collection.drop() instead of collection.remove(query)
• --usePowerOf2sizes option makes disk buckets more reusable
![Page 33: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/33.jpg)
Conclusion
• Understand disk layout and footprint
• See how much data is actually in RAM
• Memory mapping is cool
• Answer how much data is ok to lose
• Check on fragmentation and avoid it
![Page 34: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.net/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/34.jpg)
Solutions Architect, 10gen
Antoine Girbal
#antoinegirbal
Questions?