the ldap guys. life after berkeleydb: openldap's memory

85
S Y M S The LDAP guys. TM A Life After BerkeleyDB: OpenLDAP's Memory-Mapped Database Howard Chu CTO, Symas Corp. [email protected] Chief Architect, OpenLDAP [email protected]

Upload: others

Post on 25-Jun-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Life After BerkeleyDB: OpenLDAP's Memory-Mapped Database

Howard ChuCTO, Symas Corp. [email protected]

Chief Architect, OpenLDAP [email protected]

Page 2: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

OpenLDAP Project

● Open source code project● Founded 1998● Three core team members● A dozen or so contributors● Feature releases every 18-24 months● Maintenance releases as needed

Page 3: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

A Word About Symas

● Founded 1999● Founders from Enterprise Software world

● platinum Technology (Locus Computing)● IBM

● Howard joined OpenLDAP in 1999● One of the Core Team members● Appointed Chief Architect January 2007

Page 4: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Topics

● Overview● Background / History● Obvious Solutions● Future Directions

Page 5: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Overview

● OpenLDAP has been delivering reliable, high performance for many years

● The performance comes at the cost of fairly complex tuning requirements

● The implementation is not as clean as it could be; it is not what was originally intended

● Cleaning it up requires not just a new server backend, but also a new low-level database

● The new approach has a huge payoff

Page 6: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Background

● OpenLDAP already provides a number of reliable, high performance transactional backends● Based on Oracle BerkeleyDB (BDB)● back-bdb released with OpenLDAP 2.1 in 2002● back-hdb released with OpenLDAP 2.2 in 2003● Intensively analyzed for performance● World's fastest since 2005● Many heavy users with zero downtime

Page 7: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Background

● These backends have always required careful, complex tuning● Data comes through three separate layers of

caches● Each cache layer has different size and speed

characteristics● Balancing the three layers against each other can

be a difficult juggling act● Performance without the backend caches is

unacceptably slow - over an order of magnitude...

Page 8: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Background

● The backend caching significantly increased the overall complexity of the backend code● Two levels of locking required, since the BDB

database locks are too slow● Deadlocks occurring routinely in normal operation,

requiring additional backoff/retry logic

Page 9: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Background

● The caches were not always beneficial, and were sometimes detrimental● data could exist in 3 places at once - filesystem,

database, and backend cache - thus wasting memory

● searches with result sets that exceeded the configured cache size would reduce the cache effectiveness to zero

● malloc/free churn from adding and removing entries in the cache could trigger pathological heap behavior in libc malloc

Page 10: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Background

● Overall the backends require too much attention● Too much developer time spent finding

workarounds for inefficiencies● Too much administrator time spent tweaking

configurations and cleaning up database transaction logs

Page 11: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Obvious Solutions

● Cache management is a hassle, so don't do any caching● The filesystem already caches data, there's no

reason to duplicate the effort

● Lock management is a hassle, so don't do any locking● Use Multi-Version Concurrency Control (MVCC)● MVCC makes it possible to perform reads with no

locking

Page 12: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Obvious Solutions

● BDB supports MVCC, but it still requires complex caching and locking

● To get the desired results, we need to abandon BDB

● Surveying the landscape revealed no other database libraries with the desired characteristics

● Time to write our own...

Page 13: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

MDB Approach

● Based on the "Single-Level Store" concept● Not new, first implemented in Multics in 1964● Access a database by mapping the entire database

into memory● Data fetches are satisfied by direct reference to the

memory map, there is no intermediate page or buffer cache

Page 14: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Single-Level Store

● The approach is only viable if process address spaces are larger than the expected data volumes● For 32 bit processors, the practical limit on data

size is under 2GB● For common 64 bit processors which only

implement 48 bit address spaces, the limit is 47 bits or 128 terabytes

● The upper bound at 63 bits is 8 exabytes

Page 15: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

MDB Approach

● Uses a read-only memory map● Protects the database structure from corruption due

to stray writes in memory● Any attempts to write to the map will cause a SEGV,

allowing immediate identification of software bugs● There's no point in making the pages writable

anyway, since only existing pages may be written. Growing the database requires file ops (write, ftruncate) so for uniformity, file ops are also used for updates.

Page 16: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

MDB Approach

● Implement MVCC using copy-on-write● In-use data is never overwritten, modifications are

performed by copying the data and modifying the copy

● Since updates never alter existing data, the database structure can never be corrupted by incomplete modifications– Write-ahead transaction logs are unnecessary

● Readers always see a consistent snapshot of the database, they are fully isolated from writers– Read accesses require no locks

Page 17: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

MVCC Details

● "Full" MVCC can be extremely resource intensive● Databases typically store complete histories reaching far back into

time● The volume of data grows extremely fast, and grows without bound

unless explicit pruning is done● Pruning the data using garbage collection or compaction requires

more CPU and I/O resources than the normal update workload– Either the server must be heavily over-provisioned, or updates must be

stopped while pruning is done● Pruning requires tracking of in-use status, which typically involves

reference counters, which require locking

Page 18: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

MDB Approach

● MDB nominally maintains only two versions of the database● Rolling back to a historical version is not interesting for

OpenLDAP● Older versions can be held open longer by reader

transactions

● MDB maintains a free list tracking the IDs of unused pages● Old pages are reused as soon as possible, so data

volumes don't grow without bound

● MDB tracks in-use status without locks

Page 19: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Implementation Highlights

● MDB library started from the append-only btree code written by Martin Hedenfalk for his ldapd, which is bundled in OpenBSD● Stripped out all the parts we didn't need (page

cache management)● Borrowed a couple pieces from back-bdb for

expedience● Changed from append-only to page-reclaiming● Restructured to allow adding ideas from BDB that

we still wanted

Page 20: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Implementation Highlights

● Resulting library was under 32KB of object code● Compared to the original btree.c at 39KB● Compared to BDB at 1.5MB

● API is loosely modeled after the BDB API to ease migration of back-bdb code to use MDB

Page 21: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

PgnoMisc...

Database Page

PgnoMisc...offset

key, data

Data Page

PgnoMisc...Root

Meta Page

Basic Elements

Page 22: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

Pgno: 0Misc...Root : EMPTY

Meta Page Write-Ahead Log

Write-Ahead Logger

Page 23: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

Pgno: 0Misc...Root : EMPTY

Meta Page

Add 1,foo topage 1

Write-Ahead Log

Write-Ahead Logger

Page 24: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

Pgno: 1Misc...offset: 4000

1,foo

Data Page

Pgno: 0Misc...Root : 1

Meta Page

Add 1,foo topage 1

Write-Ahead Log

Write-Ahead Logger

Page 25: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

Pgno: 1Misc...offset: 4000

1,foo

Data Page

Pgno: 0Misc...Root : 1

Meta Page

Add 1,foo topage 1Commit

Write-Ahead Log

Write-Ahead Logger

Page 26: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

Pgno: 1Misc...offset: 4000

1,foo

Data Page

Pgno: 0Misc...Root : 1

Meta Page

Add 1,foo topage 1CommitAdd 2,bar topage 1

Write-Ahead Log

Write-Ahead Logger

Page 27: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

Pgno: 1Misc...offset: 4000offset: 30002,bar1,foo

Data Page

Pgno: 0Misc...Root : 1

Meta Page

Add 1,foo topage 1CommitAdd 2,bar topage 1

Write-Ahead Log

Write-Ahead Logger

Page 28: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

Pgno: 1Misc...offset: 4000offset: 30002,bar1,foo

Data Page

Pgno: 0Misc...Root : 1

Meta Page

Add 1,foo topage 1CommitAdd 2,bar topage 1Commit

Write-Ahead Log

Write-Ahead Logger

Page 29: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

Pgno: 1Misc...offset: 4000offset: 30002,bar1,foo

Data Page

Pgno: 0Misc...Root : 1

Meta Page

Add 1,foo topage 1CommitAdd 2,bar topage 1CommitCheckpoint

Write-Ahead Log

Write-Ahead Logger

Pgno: 1Misc...offset: 4000offset: 30002,bar1,foo

Data Page

Pgno: 0Misc...Root : 1

Meta Page

RA

MD

isk

Page 30: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

Pgno: 0Misc...Root : EMPTY

Meta Page

Append-Only

Page 31: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree OperationAppend-Only

Pgno: 1Misc...offset: 4000

1,foo

Data Page

Pgno: 0Misc...Root : EMPTY

Meta Page

Page 32: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree OperationAppend-Only

Pgno: 1Misc...offset: 4000

1,foo

Data Page

Pgno: 0Misc...Root : EMPTY

Meta Page

Pgno: 2Misc...Root : 1

Meta Page

Page 33: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree OperationAppend-Only

Pgno: 1Misc...offset: 4000

1,foo

Data Page

Pgno: 0Misc...Root : EMPTY

Meta Page

Pgno: 2Misc...Root : 1

Meta Page

Pgno: 3Misc...offset: 4000offset: 30002,bar1,foo

Data Page

Page 34: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree OperationAppend-Only

Pgno: 1Misc...offset: 4000

1,foo

Data Page

Pgno: 0Misc...Root : EMPTY

Meta Page

Pgno: 2Misc...Root : 1

Meta Page

Pgno: 3Misc...offset: 4000offset: 30002,bar1,foo

Data Page

Pgno: 4Misc...Root : 3

Meta Page

Page 35: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree OperationAppend-Only

Pgno: 1Misc...offset: 4000

1,foo

Data Page

Pgno: 0Misc...Root : EMPTY

Meta Page

Pgno: 2Misc...Root : 1

Meta Page

Pgno: 3Misc...offset: 4000offset: 30002,bar1,foo

Data Page

Pgno: 4Misc...Root : 3

Meta Page

Pgno: 5Misc...offset: 4000offset: 30002,bar1,blah

Data Page

Page 36: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree OperationAppend-Only

Pgno: 1Misc...offset: 4000

1,foo

Data Page

Pgno: 0Misc...Root : EMPTY

Meta Page

Pgno: 2Misc...Root : 1

Meta Page

Pgno: 3Misc...offset: 4000offset: 30002,bar1,foo

Data Page

Pgno: 4Misc...Root : 3

Meta Page

Pgno: 5Misc...offset: 4000offset: 30002,bar1,blah

Data Page

Pgno: 6Misc...Root : 5

Meta Page

Page 37: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree OperationAppend-Only

Pgno: 1Misc...offset: 4000

1,foo

Data Page

Pgno: 0Misc...Root : EMPTY

Meta Page

Pgno: 2Misc...Root : 1

Meta Page

Pgno: 3Misc...offset: 4000offset: 30002,bar1,foo

Data Page

Pgno: 4Misc...Root : 3

Meta Page

Pgno: 5Misc...offset: 4000offset: 30002,bar1,blah

Data Page

Pgno: 6Misc...Root : 5

Meta Page

Pgno: 7Misc...offset: 4000offset: 30002,xyz1,blah

Data Page

Page 38: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree OperationAppend-Only

Pgno: 1Misc...offset: 4000

1,foo

Data Page

Pgno: 0Misc...Root : EMPTY

Meta Page

Pgno: 2Misc...Root : 1

Meta Page

Pgno: 3Misc...offset: 4000offset: 30002,bar1,foo

Data Page

Pgno: 4Misc...Root : 3

Meta Page

Pgno: 5Misc...offset: 4000offset: 30002,bar1,blah

Data Page

Pgno: 6Misc...Root : 5

Meta Page

Pgno: 7Misc...offset: 4000offset: 30002,xyz1,blah

Data Page

Pgno: 8Misc...Root : 7

Meta Page

Page 39: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

Pgno: 0Misc...TXN: 0FRoot: EMPTYDRoot: EMPTY

Meta Page

MDB

Pgno: 1Misc...TXN: 0FRoot: EMPTYDRoot: EMPTY

Meta Page

Page 40: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

Pgno: 0Misc...TXN: 0FRoot: EMPTYDRoot: EMPTY

Meta Page

MDB

Pgno: 1Misc...TXN: 0FRoot: EMPTYDRoot: EMPTY

Meta Page

Pgno: 2Misc...offset: 4000

1,foo

Data Page

Page 41: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

Pgno: 0Misc...TXN: 0FRoot: EMPTYDRoot: EMPTY

Meta Page

MDB

Pgno: 1Misc...TXN: 1FRoot: EMPTYDRoot: 2

Meta Page

Pgno: 2Misc...offset: 4000

1,foo

Data Page

Page 42: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

Pgno: 0Misc...TXN: 0FRoot: EMPTYDRoot: EMPTY

Meta Page

MDB

Pgno: 1Misc...TXN: 1FRoot: EMPTYDRoot: 2

Meta Page

Pgno: 2Misc...offset: 4000

1,foo

Data Page

Pgno: 3Misc...offset: 4000offset: 30002,bar1,foo

Data Page

Page 43: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

Pgno: 0Misc...TXN: 0FRoot: EMPTYDRoot: EMPTY

Meta Page

MDB

Pgno: 1Misc...TXN: 1FRoot: EMPTYDRoot: 2

Meta Page

Pgno: 2Misc...offset: 4000

1,foo

Data Page

Pgno: 3Misc...offset: 4000offset: 30002,bar1,foo

Data Page

Pgno: 4Misc...offset: 4000

txn 2,page 2

Data Page

Page 44: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

Pgno: 0Misc...TXN: 2FRoot: 4DRoot: 3

Meta Page

MDB

Pgno: 1Misc...TXN: 1FRoot: EMPTYDRoot: 2

Meta Page

Pgno: 2Misc...offset: 4000

1,foo

Data Page

Pgno: 3Misc...offset: 4000offset: 30002,bar1,foo

Data Page

Pgno: 4Misc...offset: 4000

txn 2,page 2

Data Page

Page 45: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

Pgno: 0Misc...TXN: 2FRoot: 4DRoot: 3

Meta Page

MDB

Pgno: 1Misc...TXN: 1FRoot: EMPTYDRoot: 2

Meta Page

Pgno: 2Misc...offset: 4000

1,foo

Data Page

Pgno: 3Misc...offset: 4000offset: 30002,bar1,foo

Data Page

Pgno: 4Misc...offset: 4000

txn 2,page 2

Data Page

Pgno: 5Misc...offset: 4000offset: 30002,bar1,blah

Data Page

Page 46: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

Pgno: 0Misc...TXN: 2FRoot: 4DRoot: 3

Meta Page

MDB

Pgno: 1Misc...TXN: 1FRoot: EMPTYDRoot: 2

Meta Page

Pgno: 2Misc...offset: 4000

1,foo

Data Page

Pgno: 3Misc...offset: 4000offset: 30002,bar1,foo

Data Page

Pgno: 4Misc...offset: 4000

txn 2,page 2

Data Page

Pgno: 5Misc...offset: 4000offset: 30002,bar1,blah

Data Page

Pgno: 6Misc...offset: 4000offset: 3000txn 3,page 3,4txn 2,page 2

Data Page

Page 47: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

Pgno: 0Misc...TXN: 2FRoot: 4DRoot: 3

Meta Page

MDB

Pgno: 1Misc...TXN: 3FRoot: 6DRoot: 5

Meta Page

Pgno: 2Misc...offset: 4000

1,foo

Data Page

Pgno: 3Misc...offset: 4000offset: 30002,bar1,foo

Data Page

Pgno: 4Misc...offset: 4000

txn 2,page 2

Data Page

Pgno: 5Misc...offset: 4000offset: 30002,bar1,blah

Data Page

Pgno: 6Misc...offset: 4000offset: 3000txn 3,page 3,4txn 2,page 2

Data Page

Page 48: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

Pgno: 0Misc...TXN: 2FRoot: 4DRoot: 3

Meta Page

MDB

Pgno: 1Misc...TXN: 3FRoot: 6DRoot: 5

Meta Page

Pgno: 2Misc...offset: 4000offset: 30002,xyz1,blah

Data Page

Pgno: 3Misc...offset: 4000offset: 30002,bar1,foo

Data Page

Pgno: 4Misc...offset: 4000

txn 2,page 2

Data Page

Pgno: 5Misc...offset: 4000offset: 30002,bar1,blah

Data Page

Pgno: 6Misc...offset: 4000offset: 3000txn 3,page 3,4txn 2,page 2

Data Page

Page 49: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

Pgno: 0Misc...TXN: 2FRoot: 4DRoot: 3

Meta Page

MDB

Pgno: 1Misc...TXN: 3FRoot: 6DRoot: 5

Meta Page

Pgno: 2Misc...offset: 4000offset: 30002,xyz1,blah

Data Page

Pgno: 3Misc...offset: 4000offset: 30002,bar1,foo

Data Page

Pgno: 4Misc...offset: 4000

txn 2,page 2

Data Page

Pgno: 5Misc...offset: 4000offset: 30002,bar1,blah

Data Page

Pgno: 6Misc...offset: 4000offset: 3000txn 3,page 3,4txn 2,page 2

Data Page

Pgno: 5Misc...offset: 4000offset: 30002,bar1,blah

Data Page

Pgno: 7Misc...offset: 4000offset: 3000txn 4,page 5,6txn 3,page 3,4

Data Page

Page 50: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Btree Operation

Pgno: 0Misc...TXN: 4FRoot: 7DRoot: 2

Meta Page

MDB

Pgno: 1Misc...TXN: 3FRoot: 6DRoot: 5

Meta Page

Pgno: 2Misc...offset: 4000offset: 30002,xyz1,blah

Data Page

Pgno: 3Misc...offset: 4000offset: 30002,bar1,foo

Data Page

Pgno: 4Misc...offset: 4000

txn 2,page 2

Data Page

Pgno: 5Misc...offset: 4000offset: 30002,bar1,blah

Data Page

Pgno: 6Misc...offset: 4000offset: 3000txn 3,page 3,4txn 2,page 2

Data Page

Pgno: 5Misc...offset: 4000offset: 30002,bar1,blah

Data Page

Pgno: 7Misc...offset: 4000offset: 3000txn 4,page 5,6txn 3,page 3,4

Data Page

Page 51: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Notable Special Features

● Explicit Key Types● Support for reverse byte order comparisons, as well

as native binary integer comparisons● Minimizes the need for custom key comparison

functions, allows DBs to be used safely by applications without special knowledge

Page 52: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Notable Special Features

● Append Mode● Ultra-fast writes when keys are added in sequential

order● Bypasses standard page-split algorithm when

pages are filled, avoids unnecessary memcpy's● Allows databases to be bulk loaded at the full

sequential write speed of the underlying storage system

Page 53: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Notable Special Features

● Reserve Mode● Allocates space in write buffer for data of user-

specified size, returns address● Useful for data that is generated dynamically

instead of statically copied● Allows generated data to be written directly to DB

output buffer, avoiding unnecessary memcpy

Page 54: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Notable Special Features

● Fixed Mapping● Uses a fixed address for the memory map● Allows complex pointer-based data structures to be

stored directly with minimal serialization● Objects using persistent addresses can thus be

read back with no deserialization● Useful for object-oriented databases, among other

purposes

Page 55: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Microbenchmark Results

● Comparisons based on Google's LevelDB● Also tested against Kyoto Cabinet's TreeDB,

SQLite3, and BerkeleyDB● Tested using RAM filesystem (tmpfs), reiserfs

on SSD, and multiple filesystems on HDD● btrfs, ext2, ext3, ext4, jfs, ntfs, reiserfs, xfs, zfs● ext3, ext4, jfs, reiserfs, xfs also tested with external

journals

Page 56: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Microbenchmark Results

● Relative Footprint

● Clearly MDB has the smallest footprint● Carefully written C code beats C++ every time

text data bss dec hex filename

271991 1456 320 273767 42d67 db_bench

1682579 2288 296 1685163 19b6ab db_bench_bdb

96879 1500 296 98675 18173 db_bench_mdb

655988 7768 1688 665444 a2764 db_bench_sqlite3

296244 4808 1080 302132 49c34 db_bench_tree_db

Page 57: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Microbenchmark Results

Sequential0

2000000

4000000

6000000

8000000

10000000

12000000

14000000

16000000

Read Performance

Small Records

SQLite3 TreeDB LevelDB BDB MDB

Random0

100000

200000

300000

400000

500000

600000

700000

800000

900000

Read Performance

Small Records

SQLite3 TreeDB LevelDB BDB MDB

Page 58: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Microbenchmark Results

Sequential0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

7476.36 18536.37 194628.26 9168.76

33333333.33

Read Performance

Large Records

SQLite3 TreeDB LevelDB BDB MDB

Random0

500000

1000000

1500000

2000000

2500000

7690.18 17207.26 17114.79 9347.37

2012072.44

Read Performance

Large Records

SQLite3 TreeDB LevelDB BDB MDB

Page 59: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Microbenchmark Results

Sequential1

10

100

1000

10000

100000

1000000

10000000

100000000

7476.3618536.37

194628.26

9168.76

33333333.33

Read Performance

Large Records

SQLite3 TreeDB LevelDB BDB MDB

Random1

10

100

1000

10000

100000

1000000

10000000

7690.1817207.26 17114.79

9347.37

2012072.44

Read Performance

Large Records

SQLite3 TreeDB LevelDB BDB MDB

Log Scale

Page 60: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Microbenchmark Results

Sequential0

100000

200000

300000

400000

500000

600000

55859.68

345423.14

562113.55

90530.51113160.57

Asynchronous Write Performance

Small Records, tmpfs

SQLite3 TreeDB LevelDB BDB MDB

Random0

50000

100000

150000

200000

250000

300000

350000

400000

37074.11

106134.58

363504.18

47950.13

93835.04

Asynchronous Write Performance

Small Records, tmpfs

SQLite3 TreeDB LevelDB BDB MDB

Page 61: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Microbenchmark Results

Sequential0

500000

1000000

1500000

2000000

2500000

3000000

110558.32

345423.14

745712.16

182215.74

2493765.59

Batched Write Performance

Small Records, tmpfs

SQLite3 TreeDB LevelDB BDB MDB

Random0

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000

49803.28

106134.58

469263.26

61248.24

155521

Batched Write Performance

Small Records, tmpfs

SQLite3 TreeDB LevelDB BDB MDB

Page 62: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Microbenchmark Results

Sequential0

50000

100000

150000

200000

250000

300000

350000

400000

51969.65

6888.81

372023.81

86467.79

157331.66

Synchronous Write Performance

Small Records, tmpfs

SQLite3 TreeDB LevelDB BDB MDB

Random0

50000

100000

150000

200000

250000

300000

350000

400000

45850.53

7079.7

349895.03

78296.27

147775.97

Synchronous Write Performance

Small Records, tmpfs

SQLite3 TreeDB LevelDB BDB MDB

Page 63: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Microbenchmark Results

btrfs ext2 ext3 ext3x ext4 ext4x jfs jfsx ntfs reiser reiserx xfs xfsx zfs0

100000

200000

300000

400000

500000

600000

Asynchronous Sequential Write Performance

Small Records, HDD

fillseq SQLite3 TreeDB LevelDB BerkeleyDB MDB

Filesystem

Page 64: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Microbenchmark Results

btrfs ext2 ext3 ext3x ext4 ext4x jfs jfsx ntfs reiser reiserx xfs xfsx zfs0

50000

100000

150000

200000

250000

Asynchronous Random Write Performance

Small Records, HDD

fillrand SQLite3 TreeDB LevelDB BerkeleyDB MDB

Filesystem

Page 65: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Microbenchmark Results

btrfs ext2 ext3 ext3x ext4 ext4x jfs jfsx ntfs reiser reiserx xfs xfsx zfs0

500000

1000000

1500000

2000000

2500000

Batched Sequential Write Performance

Small Records, HDD

fillseqbatch SQLite3 TreeDB LevelDB BerkeleyDB MDB

Filesystem

Page 66: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Microbenchmark Results

btrfs ext2 ext3 ext3x ext4 ext4x jfs jfsx ntfs reiser reiserx xfs xfsx zfs0

50000

100000

150000

200000

250000

300000

Batched Random Write Performance

Small Records, HDD

fillrandbatch SQLite3 TreeDB LevelDB BerkeleyDB MDB

Filesystem

Page 67: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Microbenchmark Results

btrfs ext2 ext3 ext3x ext4 ext4x jfs jfsx ntfs reiser reiserx xfs xfsx zfs0

1000

2000

3000

4000

5000

6000

Synchronous Sequential Write Performance

Small Records, HDD

fillseqsync SQLite3 TreeDB LevelDB BerkeleyDB MDB

Filesystem

Page 68: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Microbenchmark Results

btrfs ext2 ext3 ext3x ext4 ext4x jfs jfsx ntfs reiser reiserx xfs xfsx zfs0

1000

2000

3000

4000

5000

6000

Synchronous Random Write Performance

Small Records, HDD

fillrandsync SQLite3 TreeDB LevelDB BerkeleyDB MDB

Filesystem

Page 69: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Microbenchmark Results

btrfs ext2 ext3 ext3x ext4 ext4x jfs jfsx ntfs reiser reiserx xfs xfsx zfs1

10

100

1000

10000

Synchronous Sequential Write Performance

Small Records, HDD, Log Scale

fillseqsync SQLite3 TreeDB LevelDB BerkeleyDB MDB

Filesystem

Page 70: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Microbenchmark Results

● Clearly the filesystem type has a huge impact on database performance

● Each database behaves differently; benchmark sites that talk about "synthetic database workloads" have no clue what they're measuring

● The only way to really know is to measure your actual application in your actual deployment

● Additional test results available on http://highlandsun.com/hyc/mdb/

Page 71: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Benchmarking...

● MDB in a real application - the OpenLDAP slapd server, using the new back-mdb slapd backend

Page 72: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Implementation Highlights

● back-mdb code is based on back-bdb/hdb● Copied the back-bdb source tree● Deleted all cache-management code● Adapted to MDB API

– conversion was feature-complete in only 5 days● Source comprises 340KB, compared to 476KB for

back-bdb/hdb - 30% smaller● Nothing sacrificed - supports all the same features

as back-hdb

Page 73: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Tradeoffs?

● Dropping the entry cache means we incur a cost to decode entries on every query, instead of simply operating on a fully-decoded entry in a cache● No problem, just make entry decoding in back-mdb cost

less than an entry cache access in back-hdb

● The copy-on-write approach allows only a single writer at a time● BDB allows multiple concurrent writers, theoretically

greater throughput● Under heavy load, BDB txn deadlocks slow it down to

slower than MDB; MDB performance remains constant

Page 74: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Results

mdb multi

mdb double

mdb single

hdb multi

hdb double

hdb single

00:00:00 00:14:24 00:28:48 00:43:12 00:57:36 01:12:00 01:26:24 01:40:48 01:55:12

00:29:47

00:24:16

00:27:05

00:52:50

00:45:59

00:50:08

Time to slapadd -q 5 million entries

realusersys

Time HH:MM:SS

Page 75: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Results

slapd size DB size0

5

10

15

20

25

30

26

15.6

6.7

9

Process and DB sizes

hdbmdb

GB

Page 76: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Results

1st 2nd 2 4 8 1600:00.00

00:43.20

01:26.40

02:09.60

02:52.80

03:36.00

04:19.20

05:02.40

04:15.40

00:16.2000:24.62

00:32.17

01:04.82

03:04.46

00:12.47 00:09.94 00:10.39 00:10.87 00:10.81 00:11.82

Initial / Concurrent Search Times

hdbmdb

Page 77: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Results

Searches/sec0

20000

40000

60000

80000

100000

120000

140000

SLAMD Search Rate Comparison

hdbmdbother 1other 2

Page 78: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Results

Modifies/sec0

5000

10000

15000

20000

25000

SLAMD Modify Rate Results

hdbmdb

Page 79: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Results

Searches/sec Modifies/sec0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

SLAMD Search with Modify

hdbmdb

Page 80: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Conclusions

● The combination of memory-mapped operation with MVCC is extremely potent for read-oriented databases● Reduced administrative overhead

– no periodic cleanup / maintenance required– no particular tuning required

● Reduced developer overhead– code size and complexity drastically reduced

● Enhanced efficiency– read performance is significantly increased

Page 81: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Conclusions

● The MDB approach is not just a one-off solution● While initially developed on desktop Linux, it has also been

ported to Windows, MacOSX, and Android with no particular difficulty

● While developed specifically for OpenLDAP, porting to other code bases is also under way– A port to SQLite 3.7.1 is available on gitorious– Replacements for BerkeleyDB in Cyrus-SASL, Heimdal,

OpenDKIM are available– Replacement for BerkeleyDB in perl DBD planned– There are probably many other suitable applications for a small-

footprint database library with low write rates and near-zero read overhead

Page 82: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Future Work

● A number of items remain on the TODO list● Investigate optimizations for write performance● Allow database max size and other settings to be grown

dynamically instead of statically configured● Functions to facilitate incremental and/or full backups● Storing back-mdb entries in native format, so no

decoding is needed at all

● None of these are show-stoppers, MDB and back-mdb already meet or exceed all expectations

Page 83: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

Where Next?

● Symas Corp. is offering (for a limited time) 6 months free support to anyone using back-mdb

● Several large Symas partners are testing and deploying MDB in their own applications, as well as back-mdb in their LDAP deployments

● A port of XDAndroid using the MDB port of SQLite3 is in progress

● A port of MemcacheDB using MDB is planned● Internal testing and benchmarking continues...

Page 84: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

SQLite Footnotes

● Time to Insert 1000 random records● Original SQLite 3.7.7.1: 22.43 seconds● SQLite with MDB: 1.06 seconds

● Time to read (Select) 1000 records● Difference unmeasurable● Instruction-level profile shows >95% of execution

time is spent in SQL parsing, only 2-3% in actual DB fetch

Page 85: The LDAP guys. Life After BerkeleyDB: OpenLDAP's Memory

S Y M SThe LDAP guys.

TM

A

SQLite Footnotes

● How "Lite" is SQLite?● MDB is over an order of magnitude more efficient

for writes● Writes are most power intensive● The big story here isn't the absolute speed, it's the

efficiency - SQLite is used extensively in smartphones, tablets, and other devices

● a 23:1 improvement in write efficiency translates to a significant gain in battery life