tunable memory in couchbase server 3.0: couchbase connect 2014
TRANSCRIPT
Tunable Memory in Couchbase Server 3.0
Chiyoung SeoSoftware Architect, Couchbase Inc.
©2014 Couchbase, Inc. 2
Data Manager in Couchbase Server
Database Bucket Architecture
NRU-Based Cache Management
Value-Only Ejection
Cache Management in Couchbase Server 3.0
Full Metadata Ejection
Performance Impact
Future Work for Performance Enhancements
Summary
Contents
Database Bucket Architecture
©2014 Couchbase, Inc. 4
Couchbase Cluster
Cluster
Manager
--------------
Data
Manager
Clu
ster
Man
ager
--------------
Data
Man
ager
ClusterManager
--------------Data
Manager
Clu
ster
Man
ager
----
----
----
--D
ata
Man
ager
ClusterManager
------
------
--Data
Manager
Couchbase
Server
Couchbase
Server
Couchbase Server
Cou
chba
se
Ser
ver
Couchbase
Serve
r
©2014 Couchbase, Inc. 5
Couchbase Server Architecture
Hea
rtbe
at
Pro
cess
mon
itor
Glo
bal s
ingl
eton
sup
ervi
sor
Con
figur
atio
n m
anag
er
on each node
Reb
alan
ce o
rche
stra
tor
Nod
e he
alth
mon
itor
one per clusa
vBuc
ket
stat
e an
d re
plic
atio
n m
anag
er
http
RE
ST
man
ag
em
ent
AP
I/Web
UI
HTTP8091
Erlang port mapper4369
Distributed Erlang21100 - 21199
Erlang/OTP
storage interface
Couchbase EP Engine
11210Memcapable 2.0
Moxi
11211Memcapable 1.0
Memcached
Persistence Layer
8092Query API
Qu
ery
En
gin
e
Data Manager Cluster Manager
©2014 Couchbase, Inc. 6
Data Manager Architecture
storage interface
DatabaseBucket
11210
Memcached
Storage Engine
DatabaseBucket
DatabaseBucket…
Bucket Engine
Shared Thread Pool
©2014 Couchbase, Inc. 7
Database Bucket Architecture
Append-only B-Tree Storage Engine
Engine APIs(get, set, del, add, append, DCP, …)
PartitionHash Table
(active)
PartitionHash Table
(replica)
PartitionHash Table
(active)…
Checkpoints Checkpoints Checkpoints
ReaderThreads
Non-IO Threads
DataReplicator
I/O Completion
Notifier
Aux-IOThreads
FlushersData
Backfill
User Configured Replica Count = 1
Batch Readers
WriterThreads
…
Item Pager
Expiry Pager
Checkpoint Manager
Shared Thread Pool
©2014 Couchbase, Inc. 8
Hash buckets
Each hash bucket is maintained by a linked list of items
Engine parameter “ht_size” to configure the initial hash bucket size
Multiple locks to synchronize accesses to hash buckets
Engine parameter “ht_locks” to configure the number of locks
Hash buckets are dynamically resized by the daemon task “hash table resizer”
NON-IO thread runs the hash table resizer task periodically
Partition Hash Table
©2014 Couchbase, Inc.
Partition Hash Table
9
Key: “K1”Metadata: exp, cas, NRU, …Value: “V1”
Key: “K5”Metadata: exp, cas, NRU, …Value: “V5”
Key: “K100”Metadata: exp, cas, NRU, …Value: “V100” …
Key: “K50”
Metadata: exp, cas, NRU, …Value: “V50”
Key: “K3”Metadata: exp, cas, NRU, …Value: “V3”
Key: “70”Metadata: exp, cas, NRU, …Value: “V70” …
Key: “K200”Metadata: exp, cas, NRU, …Value: “V200”
Key: “K150”Metadata: exp, cas, NRU, …Value: “V150”
Key: “30”Metadata: exp, cas, NRU, …Value: “V30” …
Key: “K60”Metadata: exp, cas, NRU, …Value: “V60”
Key: “K20”Metadata: exp, cas, NRU, …Value: “V20”
Key: “130”Metadata: exp, cas, NRU, …Value: “V30” …
.
.
.
HashBucket 1
HashBucket 2
HashBucket n-1
HashBucket n
.
.
.
Lock 1
Lock 2
Lock m
.
.
.
NRU-Based Cache Management
©2014 Couchbase, Inc. 11
Maintain two-bits long NRU score per item in hash table
NRU score sets to “2” for each new item and gets decremented by “1” for each READ access
Not Recently Used (NRU) Score
NRU Score Access Pattern
3 2 Accessed by READ
2 3 Incremented by Item Pager
2 Initial value for a new item
2 1 Accessed by READ
1 2 Incremented by Item Pager
1 0 Accessed by READ
0 1 Incremented by Item Pager
0 0 Accessed by READ
©2014 Couchbase, Inc. 12
Daemon task that is responsible for ejecting non-dirty items from hash table
Run if the database bucket memory usage goes beyond a high watermark and ejects items until the memory usage drops below the low watermark
Item Pager
Bucket memory configuration
Memory quota
Memory high watermark (85%)
Memory low watermark (75%)
©2014 Couchbase, Inc. 13
Phase I
1. Scan the next partition hash table and collect items with NRU score ‘3’
2. Eject items with NRU score ‘3’
3. Go to Step 1 if the memory usage is still above the low watermark
Phase II
1. Scan the next partition hash table and increment each item’s NRU score by ‘1’
2. If an item’s NRU score becomes ‘3’, then eject the item if N < P
N is a randomly generated number with a range [0, 1]
P is the probability that is based on the current memory usage, low watermark, and a partition state (active vs. replica)
3. Go to Step 1 if the memory usage is still above the low watermark
Item Pager
©2014 Couchbase, Inc. 14
Periodic daemon task scheduled once per day (10AM UTC by default)
1. Scans each partition hash table to gather the list of current resident items
2. Write {key, metadata} of those resident items into the access log file
Access log is used to restore the working set that were resident in memory before a node restart or crash
Access Log Generator
PartitionHash Table
(active)
PartitionHash Table
(replica)
PartitionHash Table
(active)…
PartitionHash Table
(active)
PartitionHash Table
(replica)
PartitionHash Table
(active)…
Access Log
Generator
…
Access Log
Generator
Shard 1 Shard n
Warm-upTask
Warm-upTask
Cache Management: Value-Only Ejection
©2014 Couchbase, Inc. 16
Each hash table item consists of {key, metadata, value} Metadata memory overhead is 40 bytes at least
Hash Table Item
Key Metadata Blob pointer
Blob value
Expiration time
CAS identifier
Sequence number (DCP)
Revision number (XDCR)
Lock expiry (GetLocked API)
Flag, NRU, …
Hash Table Item
Pointer to next item
©2014 Couchbase, Inc. 17
Application’s entire key space is maintained in the hash table Highly cache-oriented architecture
Item pager ejects only an item’s value from the hash table
Value-Only Ejection
Key: “foo” Metadata Blob pointer
Blob valuePointer to next item
Hash Table ItemStorage Engine
Batch Reader
Get(“foo”)
read_value(“foo”)
©2014 Couchbase, Inc. 18
Pros Maximize the memory utilization High performance (latency, throughput)
Create, Read, Update, Delete operations Key existence check
Cons High memory overhead due to (key + metadata) of non-
resident items in cache Slow system warm-up time because all the keys and
their metadata values should be loaded at least
Value-Only Ejection
Couchbase Server 3.0Full Metadata Ejection
©2014 Couchbase, Inc. 20
Application’s entire key space doesn’t need to be loaded in cache Reduce the memory overhead significantly in heavy DGM (Disk Greater than
Memory) cases
Item pager ejects an item’s key and metadata along with its value
Full Metadata Ejection
Key: “foo” Metadata Blob pointer
Blob valuePointer to next item
Hash Table ItemStorage Engine
Batch Reader
Get(“foo”)
read_meta_value(“foo”)
©2014 Couchbase, Inc. 21
Many of read / write APIs require an item’s metadata to be resident in cache
CAS (Compare and Set)
Add
Delete
Touch
GetMetaData
SetWithMeta
DeleteWithMeta
Implementation Impacts on APIs
©2014 Couchbase, Inc. 22
CAS (Compare and Set) API CAS operation needs to compare an item’s CAS identifier from
the client with the one in the server side Succeed only if those CAS identifiers are still the same
Implementation Impacts on APIs
Key: “foo”Metadata:
CAS id: 100Blob pointer
Blob value: “value2”
Pointer to next item
Hash Table Item{“foo”, 100, “value1”}
Storage Engine
Batch Reader
CAS(“foo”, 100, “value2”)
read_metadata(“foo”)
©2014 Couchbase, Inc. 23
Add API
Succeed only if an item is already expired or doesn’t exist
If an item is not resident in the full ejection mode, a disk lookup is required to figure out the item existence
Delete API
Succeed only if an item exists and is not deleted yet
Disk lookup is required for a non-resident item in the full ejection mode
Implementation Impacts on APIs
©2014 Couchbase, Inc. 24
Value-Only Ejection Mode
Deletion of Expired Items
Expiry Pager
Hash Table
Hash Table
Hash Table…
CheckpointQueue
Full Metadata Ejection Mode
Expiry Pager
Hash Table
Hash Table
Hash Table
…
CheckpointQueue
DB Compactor
Storage Engine
Full Metadata Ejection: Performance Impact
©2014 Couchbase, Inc. 26
More disk I/O overhead for non-resident items
CAS (Compare and Set)
Add
Delete
Touch
GetMetaData
SetWithMeta
DeleteWithMeta
If an application’s active working set can’t be fitted into the bucket memory quota, it will experience a higher latency
Performance Impacts on APIs
©2014 Couchbase, Inc. 27
Performance Impacts on Warm-up
Value-Only Ejection Mode
Load all the keys and their metadata values
into memory
Full Metadata Ejection Mode
Read the access log file
Load the values of keys that are read from the
access log file
Load the access log?
Read the access log file
Load the values of keys that are read from the
access log file
Warm-up completed
YesNo
Full ejection mode provides much faster system warm-up
©2014 Couchbase, Inc. 28
Value-Only Ejection Active working set is fitted into the bucket memory quota
Active working set changes fast over time
Light to medium DGM (Disk Greater than Memory) cases (resident ratio >= 20%)
High performance is more crucial
Full Metadata Ejection Active working set is not fitted into the bucket memory quota
Active working set changes slowly over time
Heavy DGM cases (resident ratio <= 10%) with a huge data set
Application doesn’t require high performance comparable to the value-ejection mode
Value-Only vs. Full Metadata Ejection
Future Work for Performance Enhancements
©2014 Couchbase, Inc. 30
Some APIs can be easily extended to support the better working set management or an asynchronous option to unblock the client
Async Add
Async Delete
Async Get
Get_Cached
SetWithoutCaching
…
New APIs
©2014 Couchbase, Inc. 32
Probabilistic data structure that can tell us if an item is a member of a set
A false positive is possible, but not false negative
Increasing the filter size reduces a false positive ratio at the expense of additional memory overhead
Various hash algorithms can be used
MurmurHash
CityHash
Jenkins Hash
Reduce the disk I/O lookup overhead for non-existent items
Bloom Filter
©2014 Couchbase, Inc. 33
Integrating Bloom Filter with Couchbase Server
Item Pager
Storage Engine
CompactorResizing the bloom filter during the compaction
Hash Table
Hash Table
Hash Table…
Bloom Filter per partition
Non-resident items
Summary
©2014 Couchbase, Inc. 35
More flexible cache management is necessary
Demanding on heavy DGM requirements in big data applications
Full metadata ejection
Large data set support without significant memory overhead
Performance impacts – not comparable to value-only ejection
Plan to improve the performance
API extensions
Bloom filter integration
New storage engine
Summary
Questions?