the bw-tree key-value store and its applications to …...the bw-tree key-value store and its...
TRANSCRIPT
![Page 1: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/1.jpg)
The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production
Sudipta Sengupta
Joint work with Justin Levandoski and David Lomet (Microsoft Research)
And Microsoft Product Group Partners across SQL Server, Azure DocumentDB, and Bing ObjectStore
![Page 2: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/2.jpg)
The B-Tree
• Key-ordered access to records
• Separator keys in internal nodes (to guide search) and full records in leaf nodes
• Efficient point and range lookups
• Balanced tree via page split and merge mechanisms
On Disk
In
Memory
…data datadata data datadata
![Page 3: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/3.jpg)
Design Tenets for A New B-Tree
• Lock-free operations for high concurrency
• Exploit modern multi-core processors
• Log-structured Storage Organization
• Exploit fast random read property of flash and
work around inefficient random writes
• Delta updates to pages
• Reduces cache invalidation in memory hierarchy
• Reduces garbage creation and write amplification
on flash, increases device lifetime
![Page 4: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/4.jpg)
Bw-Tree Architecture
B-TreeLayer
CacheLayer
FlashLayer
• Expose API
• B-tree search/update logic
• In-memory pages only
• Logical page abstraction for
B-tree layer
• Moves pages between memory
and flash as necessary
• Reads/Writes from/to storage
• Storage management
LLAMA
Access Method
![Page 5: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/5.jpg)
The Mapping Table
• Expose logical pages to the access method layer• Translates logical page ID to physical
address
• Helps to isolate updates to a singlepage
• Central data structure for multi-threaded concurrency control
• Also used for log-structured store mapping
• Updated in lock-free manner [using compare-and-swap (CAS)]
Page ID Physical Address
Mapping Table
Page
Page
PageFlash
RAM
Page
1 bit 63 bits
flash/ memflag
address
![Page 6: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/6.jpg)
Page P
Insert record
on page P
Page P
Page
ID
Physical
Address
P
Mapping Table
Δ: Insert record 50
Δ: Delete record 48
Δ: Update record 35 Δ: Insert record 60
Consolidated Page P
Update record 35 Insert record 60
![Page 7: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/7.jpg)
PID Physical Address
2
Mapping Table
Split Δ
Page 1 Page 2 Page 3
Page 4
4
Index Entry Δ
Logical pointer
Physical pointer
![Page 8: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/8.jpg)
Flash SSDs: Log-Structured Storage
FusionIO 160GB ioDrive
3x
134725 134723
49059
17492
0
25000
50000
75000
100000
125000
150000
seq-reads rand-reads seq-writes rand-writesIO
PS
Use flash in a log-structured manner
![Page 9: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/9.jpg)
LLAMA Log-Structured Store
• Suitable for flash + other benefits
• Amortize cost of writes over many page updates• Aggregate large amounts of new/changed data
and append to the log in a single I/O
• Multiple random reads to fetch a “logical page”• Okay for flash, in the order of few tens of usec
• Works well for hard disks also• Benefit of amortizing page write cost
• Random reads incur seek latency but mitigated by capturing working set of pages in RAM
B-TreeLayer
CacheLayer
StorageLayer (disk or flash or other NVM)
![Page 10: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/10.jpg)
Base page
RAM
Flash Memory
.
.
.
.
.
.
Mapping table
Sequential log
Write ordering in log
Base page
Base page
-record
-record
-record
![Page 11: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/11.jpg)
Departure from Tradition: Page Layout on Flash
• Logical pages are formed by linking together records on possibly different physical pages• Logical pages do not correspond to whole physical pages on flash
• Physical pages on flash contain records from multiple logical pages
• Exploits random access nature of flash media• No disk-like seek overhead in reading records in a logical page spread
across multiple physical pages on flash
• Adapted from SkimpyStash (ACM SIGMOD 2011)
![Page 12: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/12.jpg)
Base page
Log-structured Store on SSD
.
.
.
.
.
Mapping
table
Wri
te o
rderi
ng
in
lo
g
Base page
Base page
-record
-record
(Latch-free)
Flush Buffer
(8MB)
.
.
Base page
-record
-record
RAM
-record
.
.
Disk
RAM
-record
![Page 13: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/13.jpg)
• Reading a “logical” page may involve reading delta records from
multiple physical pages
– Probably okay because of fast random access property of flash
– Mitigated by capturing working set of pages in memory
• But we can reduce read I/Os further
– Multiple delta records, when flushed together, are packed into a contiguous unit
on flash (C-delta)
– Pages consolidated periodically in memory also get consolidated on flash when
they are flushed
Flash
Base page
Multiple delta records written together (C-delta)
![Page 14: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/14.jpg)
• Two types of record units in the log
– Valid – Reachable from the flash offset in the mapping table
– Orphaned – not reachable
• Garbage collection starts from oldest portion of log
– Earliest written record (base page) on a “logical” page is encountered first
– Avoid cascaded pointer updates up the chain => relocate entire logical page at a
time, use this opportunity to consolidate
Flash
Mapping table
Base page
Write order in log
GC point Write point
![Page 15: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/15.jpg)
LLAMA: Cache Layer
• Provide abstraction of logical pages to access method layer• Mapping table containing RAM pointers or
flash offsets
• Read pages into RAM from stable storage
• Flush pages to stable storage• Writes to flash ordered through flush buffers
• Swapout pages to reduce memory usage
B-TreeLayer
CacheLayer
StorageLayer (disk or flash or other NVM)
![Page 16: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/16.jpg)
LLAMA: Page Swapout
• Attempt to swapout pages when memory usage exceeds configurable threshold
• Uses variant of CLOCK algorithm
• Parallel page swapping functionality• Each accessor to Bw-Tree does small amount of
page swapping work (“CLOCK sweep”) if needed
• RAM pointer replaced by flash offset in mapping table
• Page structure deallocated using epoch based memory garbage collection
Page
Page
PageFlash
RAM
Page
Page
![Page 17: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/17.jpg)
Bw-Tree/LLAMA Checkpointing
• B-Tree layer checkpointing (for durability)• Flush pages to flush buffer and subsequently to
storage
• LLAMA checkpointing (for fast recovery)• Write the mapping table to flash
-> When an entry contains RAM address, obtain flash address from the in-memory page
-> Unused entries are written as zeroes
• Record write position in log when the checkpoint started
• Alternate between two fixed regions on flash for each checkpoint
Page ID Flash Offset
Mapping Table
RSP
Flash Log
GC
![Page 18: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/18.jpg)
Bw-Tree Fast Recovery
• Restore mapping table from latest checkpoint region
• Scan from log position recorded in checkpoint to end of log • Read page ID from C-delta on log and
update flash offset in mapping table
• Restore Bw-tree root page LPID
• Optimizations for fast cache warm-up
Base page
RAM
Flash Memory
.
.
.
.
.
.
MTable
Sequential log
Base page
Base page
-record
-record
-record
![Page 19: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/19.jpg)
Bw-Tree: Support for Transactions
Transactional Component
Bw-Tree Latch Free Ordered Index
Latch-Free Linear Hashing
App Needing Transactional
Key-Value Store
App Needing AtomicKey-Value Store
App Needing High Performance Log
Structured “Page” Store
Data Component
Deu
tero
no
my
Arc
hit
ectu
re
Access Method
LLAMA: Page Storage Engine
![Page 20: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/20.jpg)
End-to-end Crash Recovery
• Data Component (DC) recovery• Bw-Tree fast recovery as described
• Transactional Component (TC) recovery• Helps to recover unflushed data at DC “up to” end of
stable log (WAL) at time of crash
• Requires DC to recover to a logically consistent state first
Transaction Component (TC)
Storage
Data Component (DC)
RecordOperations
Control Operations
![Page 21: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/21.jpg)
Bw-Tree in Production• Key-sequential index in SQL Server Hekaton
• Lock-free for high concurrency, consistent with Hekaton’s overall non-blocking main memory architecture
• Indexing engine in Azure DocumentDB• Rich query processing over a schema-free
JSON model, with automatic indexing• Sustained document ingestion at high rates
• Sorted key-value store in Bing ObjectStore• Support range queries• Optimized for flash SSDs
ObjectStore
![Page 22: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/22.jpg)
DocumentDB
• Formal query model optimized for queries over schema-less documents at scale
• Support for relational and hierarchical projections
• Consistent indexing in face of rapid, sustained high volume writes (optimized for flash SSDs)
• Developer tunable consistency-availability tradeoffs with SLAs
• Low latency, (Javascript) language integrated, transactional CRUD on storage partitions
• Elastic scale, resource governed, multi-tenant PaaS
Relational StoresFully schematized, relational
queries, transactions (e.g., SQL
Azure, Amazon RDS, SQL IaaS)
K V
Key-Value/
Column Family Stores
K VKey Value
k1 XML
k2 .NET
k3 Java
Schema-less with opaque
values, lookups on keys (e.g.,
Azure Tables, HBASE, BigTable,
LevelDB, Cassandra, …)
(JSON) Document StoresSchema-less, rich hierarchical queries, (Javascript) sprocs/triggers/UDFs
(e.g. MongoDB, CouchDB, Espresso, …)
{
"location": [
{ "country": "USA", "city": "NYC" },
{ "country": "Italy", "city": "Rome" }
], "main": "Pisa",
"exports":[
{ "city": "Oslo" },
{ "city": "Lima" }
]
}
/"exports"/?/"city"/"!"-> eval("js", "function(input, output) { output.results= input.results.sort(); }")
location main exports
Pisa 0 1
city
Oslo
city
Lima
0
country city
USA NYC
1
country city
Italy Rome
![Page 23: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/23.jpg)
Write
• Sustained high volume writes without any term locality • Extremely high concurrency
• Queries should honor various consistency levels • Multi-tenancy with strict, reservation based,
sub-process level resource governance
Consistent Indexing over schema-less documents is an overly constrained design space
![Page 24: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/24.jpg)
t {d1, d2, d3}
Base Page
t {d4+}
t {d2-}
t {d1, d3, d4}
Consolidated Page
{d1, d2, d3} {d4+} {d2
-}
{d1, d3, d4}
Page
t {d4+}
Read Write
t {d1, d2, d3}
t {d1, d2, d3, d4}
Page Stub
Modify
Blind update
Blind update
• Index update: No key locality; cannot afford a Read to
do the Write; low Write Amplification
• Queries: Low Read Amplification
• Frugal resource budget
Key Challenges
![Page 25: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/25.jpg)
Bw-Tree Resource Governance• CPU resource governance
• Threads calling into Bw-Tree do not block (upon I/O or in-memory page access)
• Top-level scheduler controls thread budget per replica
• Memory resource governance• Dynamically configurable buffer pool limit
• IOPS resource governance• Check resource usage before issuing I/O, retry after dynamically
computed timeout interval
• Storage resource governance• LLAMA log-structured store can grow/shrink dynamically
• Self-adjusting based on logical data size
![Page 26: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/26.jpg)
Bringing up Bw-Tree Replica
• Obtain Bw-Tree physical state stream from primary• LLAMA checkpoint file (most recent)
• Valid portion of LLAMA log (between GC and write points)
• Bring up Bw-Tree using fast recovery
• Catch up with primary • Replay logical operations from primary
with LSNs upward of last (contiguous) LSN in recovered Bw-Tree
Primary
Secondary Secondary
Reads/Writes
ReadsReads
Existing Replica New Replica
Log Structured Store
Bw-Tree Index Bw-Tree Index
Log Structured Store
Document Replication +
Indexing
![Page 27: The Bw-Tree Key-Value Store and Its Applications to …...The Bw-Tree Key-Value Store and Its Applications to Server/Cloud Data Management in Production Sudipta Sengupta Joint work](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea6fa7267bd6b2b514b0ccf/html5/thumbnails/27.jpg)
Bw-Tree: Summary
• Classical B-Tree redesigned from ground up for modern hardware and cloud• Lock-free for high concurrency on multi-core processors
• Delta updating of pages in memory for cache efficiency
• Log-structured storage organization for flash SSDs
• Flexible resource governance in multi-tenant setting
• Transactional component can be layered above as part of Deuteronomy architecture
• Shipping in Microsoft’s server/cloud offerings• Key-sequential index in SQL Server Hekaton
• Indexing engine in Azure DocumentDB
• Sorted key-value store in Bing ObjectStore
• Going forward• Layer a transactional component on top as per Deuteronomy architecture (CIDR 2015,
VLDB 2016)
• Open-source the codebase