Transcript
Page 1: Webinar: Understanding Storage for Performance and Data Safety

Solutions Architect, 10gen

Antoine Girbal

#antoinegirbal

Understanding Storage for performance and data safety

Page 2: Webinar: Understanding Storage for Performance and Data Safety

Why pop the hood?• Understanding data safety

• Estimating RAM / disk requirements

• Optimizing performance

Page 3: Webinar: Understanding Storage for Performance and Data Safety

Storage Layout

Page 4: Webinar: Understanding Storage for Performance and Data Safety

drwxr-xr-x 4 antoine wheel 136 Nov 19 10:12 journal-rw------- 1 antoine wheel 16777216 Oct 25 14:58 test.0-rw------- 1 antoine wheel 134217728 Mar 13 2012 test.1-rw------- 1 antoine wheel 268435456 Mar 13 2012 test.2-rw------- 1 antoine wheel 536870912 May 11 2012 test.3-rw------- 1 antoine wheel 1073741824 May 11 2012 test.4-rw------- 1 antoine wheel 2146435072 Nov 19 10:14 test.5-rw------- 1 antoine wheel 16777216 Nov 19 10:13 test.ns

Directory Layout

Page 5: Webinar: Understanding Storage for Performance and Data Safety

Directory Layout

• Each database has one or more data files, all in same folder (e.g. test.0, test.1, …)

• Aggressive preallocation (always 1 spare file)

• Those files get larger and larger, up to 2GB

• There is one namespace file per db which can hold 24000 entries per default. A namespace is a collection or an index.

• The journal folder contains the journal files

Page 6: Webinar: Understanding Storage for Performance and Data Safety

Tuning with options

• Use --directoryperdb to separate dbs into own folders which allows to use different volumes (isolation, performance)

• Use --nopreallocate to prevent preallocation

• Use --smallfiles to keep data files smaller

• If using many databases, use –nopreallocate and --smallfiles to reduce storage size

• If using thousands of collections & indexes, increase namespace capacity with --nssize

Page 7: Webinar: Understanding Storage for Performance and Data Safety

Internal Structure

Page 8: Webinar: Understanding Storage for Performance and Data Safety

Internal File Format

• Files on disk are broken into extents which contain the documents

• A collection has 1 to many extents

• Extent grow exponentially up to 2GB

• Namespace entries in the ns file point to the first extent for that collection

Page 9: Webinar: Understanding Storage for Performance and Data Safety

test.0 test.1 test.2

Internal File Formattest.ns Namespaces

Extents

Data Files

Page 10: Webinar: Understanding Storage for Performance and Data Safety

Extent Structure

Extentlength

xNext

xPrev

firstRecord

lastRecord

Extentlength

xNext

xPrev

firstRecord

lastRecord

Page 11: Webinar: Understanding Storage for Performance and Data Safety

Extents and Records

Extentlength

xNext

xPrev

firstRecord

lastRecord

Data Recordlength

rNext

rPrev

Document

{ _id: “foo”, ... }

Data Recordlength

rNext

rPrev

Document

{ _id: “bar”, ... }

Page 12: Webinar: Understanding Storage for Performance and Data Safety

What about indices?

Page 13: Webinar: Understanding Storage for Performance and Data Safety

Indexes

• Indexes are BTree structures serialized to disk

• They are stored in the same files as data but using own extents

Page 14: Webinar: Understanding Storage for Performance and Data Safety

Index Extents

Extentlength

xNext

xPrev

firstRecord

lastRecord

Index Record

Bucketparent

numKeys

K

length

rNext

rPrev K KK

Index Record

Bucketparent

numKeys

K

length

rNext

rPrev K KK

{ Document }

4 9

1 3 5 6 8 A B

Page 15: Webinar: Understanding Storage for Performance and Data Safety

> db.stats(){

"db" : "test","collections" : 22,"objects" : 17000383, ## number of documents"avgObjSize" : 44.33690276272011,"dataSize" : 753744328, ## size of data"storageSize" : 1159569408, ## size of all

containing extents"numExtents" : 81,"indexes" : 85,"indexSize" : 624204896, ## separate index

storage size"fileSize" : 4176478208, ## size of data files on

disk"nsSizeMB" : 16,"ok" : 1

}

the db stats

Page 16: Webinar: Understanding Storage for Performance and Data Safety

> db.large.stats(){

"ns" : "test.large","count" : 5000000, ## number of documents"size" : 280000024, ## size of data"avgObjSize" : 56.0000048,"storageSize" : 409206784, ## size of all

containing extents"numExtents" : 18,"nindexes" : 1,"lastExtentSize" : 74846208,"paddingFactor" : 1, ## amount of padding"systemFlags" : 0,"userFlags" : 0,"totalIndexSize" : 162228192, ## separate index

storage size"indexSizes" : {

"_id_" : 162228192},"ok" : 1

}

the collection stats

Page 17: Webinar: Understanding Storage for Performance and Data Safety

What’s memory mapping?

Page 18: Webinar: Understanding Storage for Performance and Data Safety

Memory Mapped Files

• All data files are memory mapped to RAM by the OS

• Mongo just reads / writes to RAM in the filesystem cache

• OS takes care of the rest!

• Mongo calls fsync every 60 seconds to flush changes to disk

• Virtual process size = total files size + overhead (connections, heap)

• If journal is on, the virtual size will be roughly doubled

Page 19: Webinar: Understanding Storage for Performance and Data Safety

Virtual Address Space

32-bit System

232 = 4GB

- 1GB kernel

- .5GB binaries, stack, etc.

= 2.5GB for data

BAD

64-bit System

264 = 1.7 x 1010 GB (16EB)?

0xF0 – 0xFF Kernel

0x00 – 0x7F User

247 = 128TB for data

GOOD

Page 20: Webinar: Understanding Storage for Performance and Data Safety

Virtual Address Space

Kernel

STACK…

LIBS

test.ns

test.0

test.1

…HEAP

MONGOD

NULL

0x7fffffffffff

0x0

{ … }

Disk

DocumentProcess Virtual

Memory

Page 21: Webinar: Understanding Storage for Performance and Data Safety

Memory map, love it or hate it• Pros:

– No complex memory / disk code in MongoDB, huge win!

– The OS is very good at caching for any type of storage

– Pure Least Recently Used behavior– Cache stays warm between Mongo restart

• Cons:– RAM is affected by disk fragmentation– RAM is affected by high read-ahead– LRU behavior does not prioritize things (like

indices)

Page 22: Webinar: Understanding Storage for Performance and Data Safety

How much data is in RAM?• Resident memory the best indicator of

how much data in RAM

• Resident is: process overhead (connections, heap) + FS pages in RAM that were accessed

• Means that it resets to 0 upon restart even though data is still in RAM due to FS cache

• Use free command to check on FS cache size

• Can be affected by fragmentation and read-ahead

Page 23: Webinar: Understanding Storage for Performance and Data Safety

Journaling

Page 24: Webinar: Understanding Storage for Performance and Data Safety

The problem• A single insert/update involves writing to many

places (the record, indexes, ns details..)

• What if the electricity goes out? Corruption…

Page 25: Webinar: Understanding Storage for Performance and Data Safety

Solution – use a journal

• Data gets written to a journal before making it to the data files

• Operations written to a journal buffer in RAM that gets flushed every 100ms or 100MB

• Once journal written to disk, data safe unless hardware entirely fails

• Journal prevents corruption and allows durability

• Can be turned off, but don’t!

Page 26: Webinar: Understanding Storage for Performance and Data Safety

• Section contains single group commit

• Applied all-or-nothing

Journal FormatJHeader

JSectHeader [LSN 3]

DurOp

DurOp

DurOp

JSectFooter

JSectHeader [LSN 7]

DurOp

DurOp

DurOp

JSectFooter

Op_DbContext

lengthoffsetfileNodata[length]

lengthoffsetfileNodata[length]

lengthoffsetfileNodata[length]

Write Operation

Set database context for subsequent operations

Page 27: Webinar: Understanding Storage for Performance and Data Safety

Can I lose data on hard crash?• Maximum data loss is 100ms (journal flush). This

can be reduced with –journalCommitInterval

• For durability (data is on disk when ack’ed) use the JOURNAL_SAFE write concern (“j” option).

• Note that replication can reduce the data loss further. Use the REPLICAS_SAFE write concern (“w” option).

• As write guarantees increase, latency increases. To maintain performance, use more connections!

Page 28: Webinar: Understanding Storage for Performance and Data Safety

What is cost of journal?

• On read-heavy systems, no impact

• Write performance is reduced by 5-30%

• If using separate drive for journal, as low as 3%

• For apps that are write-heavy (1000+ writes per server) there can be slowdown due to mix of journal and data flushes. Use a separate drive!

Page 29: Webinar: Understanding Storage for Performance and Data Safety

Fragmentation

Page 30: Webinar: Understanding Storage for Performance and Data Safety

Fragmentation

• Files can get fragmented over time if remove() and update() are issued.

• It gets worse if documents have varied sizes

• Fragmentation wastes disk space and RAM

• Also makes writes scattered and slower

• Fragmentation can be checked by comparing size to storageSize in the collection’s stats.

Page 31: Webinar: Understanding Storage for Performance and Data Safety

How it looks like

EXTENT

Doc Doc Doc X Doc X

Doc X Doc Doc X

BOTH ON DISK AND IN RAM!

Page 32: Webinar: Understanding Storage for Performance and Data Safety

How to combat fragmentation?• compact command (maintenance op)

• Normalize schema more (documents don’t grow)

• Pre-pad documents (documents don’t grow)

• Use separate collections over time, then use collection.drop() instead of collection.remove(query)

• --usePowerOf2sizes option makes disk buckets more reusable

Page 33: Webinar: Understanding Storage for Performance and Data Safety

Conclusion

• Understand disk layout and footprint

• See how much data is actually in RAM

• Memory mapping is cool

• Answer how much data is ok to lose

• Check on fragmentation and avoid it

Page 34: Webinar: Understanding Storage for Performance and Data Safety

Solutions Architect, 10gen

Antoine Girbal

#antoinegirbal

Questions?


Top Related