bookie storage - apache bookkeeper meetup - 2015-06-28

Bookie Storage

M a t t e o M e r l i

BookKeeper

2

▪ Provides distributed logs (ledgers) ▪ BookKeeper client + Bookies ▪ Client API can be summarized as : › createLedger() → ledgerId › ledger.addEntry(data) → entryId › ledger.readEntry(ledgerId, entryId) › deleteLedger(ledgerId)

▪ BK Client library implements all the “logic” › Consistency, metadata in ZK, fencing, recovery, replication

▪ Bookie Server are charged to store the data

Bookie Storage

Bookie external interface

3

▪ Simple primitives › addEntry(ledgerId, entryId, payload) → OK › readEntry(ledgerId, entryId) → payload › getLastEntry(ledgerId) → entryId

▪ Is that all?? › Fence flag on readEntry() → no more writes allowed to a ledger › Deletion → background garbage collection › Auto replication → it's a different logical component that uses the BK client API

Bookie Storage

Interleaved storage

4 Bookie Storage

▪ Default bookie storage ▪ Use journal on a separate device › Every entry is synced on the journal

▪ Entries are also written to "entryLog" files as they come in › Writes on the entryLog are periodically flushed in background › Entries are appended to the current entry log file › When entryLog reaches 2GB, a new one is created › Entries for multiple ledgers are interleaved in the same entry log

▪ Need to maintain and index (ledgerId, entryId) → (entryLogId, offset) › Default implementation uses a file for each ledger to store the data locations

Bookie Garbage Collection

5 Bookie Storage

▪ Runs periodically in background ▪ Get the list of ledgers stored locally ▪ Get list of ledgers from ZK ▪ Whatever ledger is not in ZK is marked for deletion

When are entry logs deleted?

6

▪ Need to keep track of usage of each entry log ▪ EntryLog metadata, in memory map for each entryLog › (ledgerId → size)

▪ Whenever a ledger is deleted, each entry log will update the usage ▪ Metadata is appended to each entryLog, to avoid having to scan the log

when bookie restarts (since 4.4.0)

▪ If the entryLog usage is 0% → delete it ▪ If usage falls below x % → compaction

Bookie Storage

Entry log compaction

7 Bookie Storage

▪ There are 2 compactions which differs in threshold : › Minor (every 1 hour, usage < 50%) › Major (every 1 day, usage < 80%)

▪ Scan the entryLog file and append all valid entries into the current (newer) entryLog file

▪ Update the indexes to point to new location

Changes already done

8 Bookie Storage

▪ Writes interleaved in entryLog makes poor read performance › Typically you want to read many entries sequentially › In SortedLedgerStorage (since BK-4.3) and in DbLedgerStorage (scheduled for

BK-4.5), there’s the concept of write-cache : • Defer the writing to entryLog and sort by ledgerId/entryId to have entries stored sequentially.

› On the same note, using read-ahead cache will amortize IO ops

▪ Use RocksDB to maintain indexes › In DbLedgerStorage we load all the offsets into RocksDB. Helps when storing many

ledgers (tested with few millions) in a single bookie.

Improvements areas / 1

9

▪ JVM GC still has impact on latencies › Already done several improvements › GC cannot be avoided, going to 0 allocation per entry written is not practical › Only option is to make pauses as least as possible frequent › Single bookie throughput is limited by GC rather than hardware: › Above a certain rate the latency spikes from pauses would make it miss SLA › Batching more logical entries into a single BK entry helps a lot, but it’s not always

practical

Bookie Storage

Improvements areas / 2

10

▪ Having large sequence of sequential entries and take advantage of read-ahead cache depends on flushInterval : › Frequent flushes will make for less contiguous entries › Longer interval means to have more long-lived java objects and longer pauses.

▪ Similarly, if writes are spread across many ledgers, with very low per-ledger rate, there will be few sequential entries

▪ Bookie compaction › During compaction, older entries are re-appended and mixed with new entries. › Long lived entries will get compacted all over again. › Need to keep EntryLogMetadata in memory (when storing 20TB that can be quite

significant)Bookie Storage

Consideration on Bookie storage

11

▪ Original BK implementation dates back to 2009 ▪ Bookie storage really resembles an LSM DB › Journal → Write Ahead Log › Entry Log → SSTs › Compaction › Write cache → MemTable › Read cache → LRU Block cache

▪ Why not directly store all the data in RocksDB? ▪ Can we get the same performance as current Bookie? ▪ That would replace large portion of Bookie code ▪ At that point, why not have the Bookie server in C++?

Bookie Storage

Bookie-CPP

12

▪ What is it? › Proof of Concept to validate performance assumption › Compatible with regular BK Java client › Async C++ server that writes into RocksDB › So far, only addEntry() implemented

▪ What is not › No plan to write BK client in C++

▪ ¿¿Why?? › Fully utilize IO capacity (vertical scalability) › Better compaction, no GC pauses, block-level compression, etc…

Bookie Storage

RocksDB tuning

13

▪ We can make RocksDB look like Bookie ▪ Goals : high-throughput and low-latency for writes › Use background thread to implement group-commit on top of RocksDB › To ensure writes are not stalled by compaction, use large MemTable (write-cache)

size: 4x 1GB › Use big SST size: 1GB › Big block-size: 256K (helps for HDDs) › Compaction read-ahead buffer: 8MB

Bookie Storage

How to implement deletion

14

▪ Bookie GC will still do to he same scan & compare

▪ Typically, in LSM DBs a delete operation consist is writing a tombstone marker › Data is deleted when the tombstones are pushed to the last level and the SST is

compacted

▪ RocksDB provides additional options to delete data: › DeleteFilesInRange() → immediately delete SSTs that only contains keys in that range • eg: DeleteFilesInRange( [ledgerId, ledgerId+1) )

› Compaction filter → hook into RocksDB compaction to decide which data needs to be kept when compacting. Can use the map of active ledgers to do it. Compaction can also be forced by calling CompactRange()

Bookie Storage

Preliminary tests / 1

15

▪ 1 client - 1 bookie ▪ Bookie journal: SSD + RAID BBU ▪ Bookie ledgers: HDDs ▪ Writing 60K 1KB entries/s over multiple ledgers ▪ C++ perf tool that simulates BK client and measure latency › Only send addEntry request / no actual ledger metadata in ZK › Using C++ client, removes JVM GC measure noise on the client side

▪ Measure 99pct write latency over different time intervals › 1min, 10sec, 1sec

Bookie Storage


16 Bookie Storage


17 Bookie Storage


18 Bookie Storage

Conclusions

19

▪ Preliminary results look promising ▪ Work in Progress, code at github.com/merlimat/bookie-cpp ▪ Feedback welcome ▪ Hopefully there’s interest in this area ▪ It would be great to include in main BK repository at some point

Bookie Storage

http://github.com/merlimat/bookie-cpp

bookie storage - apache bookkeeper meetup - 2015-06-28

Software