concurrent cache-oblivious b-trees using transactional memory

53
Concurrent Cache- Oblivious B-trees Using Transactional Memory Jim Sukha Bradley Kuszmaul MIT CSAIL June 10, 2006

Upload: cally-allison

Post on 30-Dec-2015

57 views

Category:

Documents


0 download

DESCRIPTION

Concurrent Cache-Oblivious B-trees Using Transactional Memory. Jim Sukha Bradley Kuszmaul MIT CSAIL June 10, 2006. Thought Experiment. Imagine that, one day, you are assigned the following task:. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Concurrent Cache-Oblivious B-trees Using Transactional

Memory

Jim Sukha

Bradley KuszmaulMIT CSAIL

June 10, 2006

Page 2: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Thought Experiment

Imagine that, one day, you are assigned the following task:

Enclosed is code for a serial, cache-oblivious B-tree. We want a reasonably efficient parallel implementation that works for disk-resident data.

Attach: COB-tree.tar.gz

PS. We want to be able to restore the data to a consistent state after a crash too.PPS. Our deadline is next week. Good luck!

Page 3: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Concurrent COB-tree?

Question:

How can one program a concurrent, cache-oblivious B-tree?

Approach:We employ transactional memory. What complications does I/O introduce?

Page 4: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Potential Pitfalls Involving I/O

Suppose our data structure resides on disk.

1. We might need to make explicit I/O calls to transfer blocks between memory and disk. But a cache-oblivious algorithm doesn’t know the block size B!

2. We might need buffer management code if the data doesn’t fit into main memory.

3. We might need to unroll I/O if we abort a transaction that has already written to disk.

Page 5: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Our Solution: Libxac

• We have implemented Libxac, a page-based transactional memory system that operates on disk-resident data. Libxac supports ACID transactions on a memory-mapped file.

• Using Libxac, we are able to implement a complex data structure that operates on disk-resident data, e.g. a cache-oblivious B-tree.

Page 6: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Libxac Handles Transaction I/O1. We might need to make explicit I/O calls to transfer blocks

between memory and disk.

Similar to mmap, Libxac provides a function xMmap. Thus, we can operate on disk-resident data without knowing block size.

2. We might need buffer management code if the data doesn’t fit into main memory.

Like mmap, the OS automatically buffers pages in memory.

3. We might need to unroll I/O if we abort a transaction that has already written to disk.

Since Libxac implements multiversion concurrency control, we still have the original version of a page even if a transaction aborts.

Page 7: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Outline

• Programming with Libxac

• Cache-Oblivious B-trees

Page 8: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Example Program with Libxacint main(void) { int* x; int status = FAILURE; xInit(“/logs”, DURABLE); x = xMmap(“input.db”, 4096);

while (status != SUCCESS) { xbegin(); x[0] ++; status = xend(); } xMunmap(x); xShutdown(); return 0;}

Transactionally maps the first page of the input file.

Transaction body. The body can be a complex function (e.g., a cache-oblivious B-tree insert!).

Unmap the region.Shutdown runtime.

Runtime initialization function. For durable transactions, logs are stored in the specified directory.*

* Currently Libxac logs the transaction commits, but we haven’t implemented the recovery program yet.

Page 9: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Libxac Memory Modelint main(void) { int* x; int status = FAILURE; xInit(“/logs”, DURABLE); x = xMmap(“input.db”, 4096);

while (status != SUCCESS) { xbegin(); x[0] ++; status = xend(); } xMunmap(x); xShutdown(); return 0;}

1. Aborted transactions are visible to the programmer (thus, programmer must explicitly retry transaction). Control flow always proceeds from xbegin() to xend(). Thus, the xaction body can contain system/library calls.

2. At xend(), all changes to xMmap’ed region are discarded on FAILURE, or committed on SUCCESS.

3. Aborted transactions always see consistent state. Read-only transactions can always succeed.

*Libxac supports concurrent transactions on multiple processes, not threads.

Page 10: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Implementation Sketch

• Libxac detects memory accesses by using a SIGSEGV handler to catch a memory protection violation on a page that has been mmap’ed.

• This mechanism is slow for normal transactions:– Time for mmap, SIGSEGV handler: ~ 10 s

• Efficient if we must perform disk I/O to log transaction commits.– Time to access disk: ~ 10 ms

Page 11: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Is xMmap practical?Experiment on a 4-proc. AMD Opteron,performing 100,000 insertions of elements with random keys into a B-tree.Each insert is a separate transaction.Libxac and BDB both implement group commit.

B-tree and COB-tree both use Libxac. Note that none of the three data structures have been properly tuned.Conclusion: We should achieve good performance.

Page 12: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Outline

• Programming with Libxac

• Cache-Oblivious B-trees

Page 13: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

What is a Cache-Oblivious B-tree?

• A cache-oblivious B-tree (e.g. [BDFC00]) is a dynamic dictionary data structure that supports searches, insertions/deletions, and range-queries.

• An cache-oblivious algorithm/data structure does not know system parameters (e.g. the block size B.)

• Theorem [FLPR99]: a cache-oblivious algorithm that is optimal for a two-level memory hierarchy is also optimal for a multi-level hierarchy.

Page 14: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Cache-Oblivious B-Tree Example

4 10

4

16 21

16

10

38 45

38

54 83

54

45

21

4----1

Static Cache-Oblivious Tree

--1076 16--1513 --21---- 38312423 45--4039 --5448-- 83705956

The COB-tree can be divided into two pieces:

1. A packed memory array that stores the data in order, but contains gaps.

2. A static cache-oblivious binary-tree that indexes the packed memory array.

Packed Memory Array (PMA)

Page 15: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Cache-Oblivious B-Tree Insert

4 10

4

16 21

16

10

38 45

38

54 83

54

45

21

4----1

Static Cache-Oblivious Tree

--1076 16--1513 --21---- 38312423 45--4039 --5448-- 83705956

To insert a key of 37:

Page 16: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Cache-Oblivious B-Tree Insert

4 10

4

16 21

16

10

38 45

38

54 83

54

45

21

4----1

Static Cache-Oblivious Tree

--1076 16--1513 --21---- 38312423 45--4039 --5448-- 83705956

To insert a key of 37:1. Find correct section of PMA location using static tree.

37

Page 17: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Cache-Oblivious B-Tree Insert

4 10

4

16 21

16

10

38 45

38

54 83

54

45

21

4----1

Static Cache-Oblivious Tree

--1076 16--1513 --21---- 38312423 45--4039 --5448-- 83705956

To insert a key of 37:1. Find correct section of PMA location using static tree.2. Insert into PMA. This step may cause a rebalance of the PMA.

37

Page 18: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Cache-Oblivious B-Tree Insert

4 10

4

16 21

16

10

38 45

38

54 83

54

45

21

4----1

Static Cache-Oblivious Tree

--1076 16--1513 --21---- 37312423 --403938 56544845 --837059

To insert a key of 37:1. Find correct section of PMA location using static tree.2. Insert into PMA. This step possibly requires a rebalance.3. Fix the static tree.

Page 19: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Cache-Oblivious B-Tree Insert

4 10

4

16 21

16

10

37 40

37

56 83

56

40

21

4----1

Static Cache-Oblivious Tree

--1076 16--1513 --21---- 37312423 --403938 56544845 --837059

To insert a key of 37:1. Find correct section of PMA location using static tree.2. Insert into PMA. This step possibly requires a rebalance.3. Fix the static tree.

Page 20: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Cache-Oblivious B-Tree Insert

4 10

4

16 21

16

10

37 40

37

56 83

56

40

21

4----1

Static Cache-Oblivious Tree

--1076 16--1513 --21---- 37312423 --403938 56544845 --837059

Insert is a complex operation. If we wanted to use locks, what is the locking protocol? What is the right (cache-oblivious?) lock granularity?

Page 21: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Conclusions

A page-based TM system such as Libxac• Represents a good match for disk-resident

data structures. – The per-page overheads of TM are small

compared to cost of I/O.

• Is easy to program with. – Libxac allows us to program a concurrent,

disk-resident data structure with ACID properties, as though it was stored in memory.

Page 22: Concurrent Cache-Oblivious  B-trees Using Transactional Memory
Page 23: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Semantics of Local Variablesint main(void) { int y=0, z=0, a=0, b=0; int* x; int status = FAILURE; xInit(“/logs”, DURABLE); x = xMmap(“input.db”, 4096);

while (status != SUCCESS) { a++; xbegin(); b = x[0]; y++; x[0]++; z = x[0] – 1; status = xend(); } xMunmap(x); xShutdown(); return 0;}

In this system, Libxac guarantees that after loop completes:

a == y

Value of a is # of times transaction is attempted.

We always have b == z because aborted transactions always see consistent state, even if other programs concurrently access the first page of input.db.

Page 24: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

TM System Improvements?

Possible improvements to Libxac:– Provide more efficient support for non-durable

transactions by modifying the OS to track report pages accessed?

– Integrate Libxac with another TM system to provide concurrency control on both multiple threads and multiple processes?

Page 25: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Implementation Sketch

x[0]

x[1024]

x[2048]

x[3072]

Memory Map

1

7

2

9

input.txt

int a;xbegin(); a = x[0]; x[1024] += a+1;xend();

Buffer File

PROT_NONE

PROT_NONE

PROT_NONE

PROT_NONE

Log File

Page 26: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Implementation Sketch

x[0]

x[1024]

x[2048]

x[3072]

Memory Map

1

7

2

9

input.txt

int a;xbegin(); a = x[0]; x[1024] += a+1;xend();

PROT_NONE

PROT_NONE

PROT_NONE

PROT_NONE

Segmentation Fault

PROT_READ

Log File Buffer File

Page 27: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Implementation Sketch

x[0]

x[1024]

x[2048]

x[3072]

Memory Map

1

7

2

9

input.txt

int a;xbegin(); a = x[0]; x[1024] += a+1;xend();

PROT_NONE

PROT_NONE

PROT_NONE

PROT_NONE

PROT_READ

Segmentation Fault

PROT_READ

Log File Buffer File

Page 28: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Implementation Sketch

x[0]

x[1024]

x[2048]

x[3072]

Memory Map

1

7

2

9

input.txt

int a;xbegin(); a = x[0]; x[1024] += a+1;xend();

PROT_NONE

PROT_NONE

PROT_NONE

PROT_NONE

PROT_READ

PROT_READ Segmentation Fault (2nd)

Log File Buffer File

Page 29: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Implementation Sketch

x[0]

x[1024]

x[2048]

x[3072]

Memory Map

1

7

2

9

input.txt

int a;xbegin(); a = x[0]; x[1024] += a+1;xend();

PROT_NONE

PROT_NONE

PROT_NONE

PROT_READ

7

copy contentsPROT_READ Segmentation Fault (2nd)

Log File Buffer File

Page 30: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Implementation Sketch

x[0]

x[1024]

x[2048]

x[3072]

Memory Map

1

7

2

9

input.txt

int a;xbegin(); a = x[0]; x[1024] += a+1;xend();

PROT_NONE

PROT_NONE

PROT_NONE

PROT_READ

7

PROT_READ|PROT_WRITE

Segmentation Fault (2nd)

Log File Buffer File

Page 31: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Implementation Sketch

x[0]

x[1024]

x[2048]

x[3072]

Memory Map

1

7

2

9

input.txt

int a;xbegin(); a = x[0]; x[1024] += a+1;xend();

PROT_NONE

PROT_NONE

PROT_NONE

PROT_READ

2

PROT_READ|PROT_WRITE

Log File Buffer File

Page 32: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Implementation Sketch

x[0]

x[1024]

x[2048]

x[3072]

Memory Map

1

7

2

9

input.txt

int a;xbegin(); a = x[0]; x[1024] += a+1;xend();

PROT_NONE

PROT_NONE

PROT_NONE

PROT_READ

2

PROT_READ|PROT_WRITE

Log File Buffer File

2log on disk

Page 33: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Implementation Sketch

x[0]

x[1024]

x[2048]

x[3072]

Memory Map

1

1

2

9

input.txt

int a;xbegin(); a = x[0]; x[1024] += a+1;xend();

PROT_NONE

PROT_NONE

PROT_NONE

2

PROT_NONE

2

copy contents

xend();

Log File Buffer File

2

Page 34: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Focus on PMA Rebalance

insert (tree, key, value) {

xbegin();

x=find_location_in_pma(tree->static_index,key);

insert_into_pma(tree->pma, key, value, x);

fix_static_index(tree->static_index);

xend();

}

In this talk, we illustrate the problems of transaction I/O considering a transactional rebalance of the packed memory array.

rebalance(tree->pma);

Page 35: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Rebalance of an In-Memory Arrayvoid rebalance(int* x, int n) {

int i;int count = 0;for (i = 0; i < n; i++) {

if (x[i] != EMPTY_SLOT) {x[count] = x[i];count++;

}}

21--0 4----3 --65--

3210 4654 --65--

2--10 --43-- --6--5

int j = count-1;double spacing = 1.0*n/count;for (i = n-1; i >= 0; i--) {

if (floor(j*spacing) == i) {

x[i] = x[j];}else {

x[i] = EMPTY_SLOT;}

}}

// Redistribute items// from right

// Slide everything left

Page 36: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Rebalance with Explicit I/Ovoid rebalance_with_I/O(int n) {

int i;int count = 0;int* y; int* z;

y = read_block(0);z = read_block(0);

for (i = 0; i < n; i++) { if (i % B == 0) {

y = read_block(i/B);}if (y[i%B] != EMPTY_SLOT) {

if (count % B == 0) {z =

read_block(count/B);}z[count%B] = y[i%B];count++;

}}...

}

1. What if the data does not fit into memory?

Why do we want to avoid performing explicit I/O to read/write data blocks?

2. A cache-oblivious algorithm does not know the value of B!

Issues:

We must have buffer management code somewhere.

write_block(…)

Page 37: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Rebalance using Memory Mappingvoid rebalance_with_mmap(int n) {

int i;int count = 0;x = mmap(“input.db”, n*sizeof(int));for (i = 0; i < n; i++) {

if (x[i] != EMPTY_SLOT) {x[count] = x[i];count++;

}}

...

munmap(x, n*sizeof(int));

}

I/O is transparent to the user.B does not appear in the application code.

If we use memory mapping, then the OS automatically buffers pages that are accessed.

1. What if the data does not fit into memory?

2. What value of B do we choose for a cache-oblivious algorithm?

Using mmap, the code looks like the in-memory rebalance.But we still need concurrency control.

Page 38: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Concurrent Rebalance?void rebalance_with_mmap(int n) {

int i;int count = 0;x = mmap(“input.db”, n*sizeof(int));for (i = 0; i < n; i++) {

if (x[i] != EMPTY_SLOT) {x[count] = x[i];count++;

}}

...

munmap(x, n*sizeof(int));

}

If we use transactions, will the system need to unroll I/O when a transaction aborts to ensure the data on disk is consistent?

If we use locks, what do we choose as the locking granularity?

What happens if we want the rebalance to occur concurrently?

write_block(…)

Page 39: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Transactional Memory Mappingvoid rebalance_with_xMmap(int n) {

int i;int count = 0;x = xMmap(“input.db”, n*sizeof(int));xbegin();for (i = 0; i < n; i++) {

if (x[i] != EMPTY_SLOT) {x[count] = x[i];count++;

}}

...

xend();xMunmap(x, n*sizeof(int));

}

Transaction system maintains multiple versions of a page to avoid needing to unroll I/O.

Replace mmap with xMmap, and use transactions for concurrency control.

Our solution:

Transactional memory mapping simplifies the code for a concurrent disk-resident data structure.

write_block(…)

Page 40: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Cache-Oblivious B-Tree Insert

4 10

4

16 21

16

10

38 45

38

54 83

54

45

21

4----1

Static Cache-Oblivious Tree

--1076 16--1513 --21---- 37312423 45403938 --5448-- 83705956

2(a) Add 37 to packed memory array.

Packed Memory Array Density Thresholds

2.5 ≤ n ≤ 7.5

6 ≤ n ≤ 14

Page 41: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Cache-Oblivious B-Tree Insert

4 10

4

16 21

16

10

38 45

38

54 83

54

45

21

4----1

Static Cache-Oblivious Tree

--1076 16--1513 --21---- 37312423 --403938 56544845 --837059

2.5 ≤ n ≤ 7.52(a) Add 37 to packed memory array.

2(b) Rebalance the PMA.

Packed Memory Array Density Thresholds

6 ≤ n ≤ 14

Page 42: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

51 70 85

Dictionary Operations using a B+-Tree

204 10 42

41 3 77105 8 3521 33 41 42

B

2012 15

The branching factor of the tree and the size of a block on disk are both (B). For a B+-tree, the data is stored at the leaves. The keys at an interior node represent the maximum key of that node’s subtree.

But to build the tree, we need to know the value of B…

Page 43: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Cache-Oblivious B-tree [BDFC00]

Search(key) O(logB N)

Insert(key, value) O(logB N)**

Delete(key, value) O(logB N)**

RangeQuery(start, end) O(logB N + k/B)*

Operations Cost in Block Transfers

*Bound assumes range query finds k items with keys between start and end.** Amortized bound.

It is possible to support dictionary operations cache-obliviously, i.e., with a data structure that does not know the value of B. The cache-oblivious B-tree (COB-tree) achieves the same asymptotic (amortized) bounds as a B+-tree.

Page 44: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Cache-Oblivious B-Tree Overview

The static tree is used as an index into a packed memory array.

To perform an insert, insert into the packed memory array, and update the static tree. When the packed memory array becomes too full (empty), rebuild and grow (shrink) the entire data structure.

Static Cache-Oblivious Binary Tree

Page 45: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Static Cache-Oblivious Binary Tree

1 2

N1/2

Static Cache-Oblivious Binary Tree:

size (N1/4)

size (N1/2)

N1/432

1

N1/432

1 N1/432

1 N1/432

1

3

N1/432

1

Divide tree into (N1/2) subtrees of size (N1/2). Layout each subtree contiguously in memory, recursively.

Page 46: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Packed Memory ArrayA packed memory array uses a contiguous section of memory to store elements in order, but with gaps.

For sections of size 2k, gaps are spaced to maintain to specified density thresholds that become arithmetically stricter as k increases.

4----1 --1076 16--1513 --21---- 38312423 45--4038 --5448-- 83705956

[4/16, 16/16] [5/16, 15/16] [6/16, 14/16]

[7/16, 13/16]

24: Density between 4/16 and 16/16.25: Density between 5/16 and 15/16.26: Density between 6/16 and 13/16.27: Density between 7/16 and 12/16

Density Thresholds Example:

Page 47: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Example Cache-Oblivious B-Tree

4 10

4

16 21

16

10

38 45

38

54 83

54

45

21

4----1

Packed Memory Array Density Thresholds

Static Cache-Oblivious Tree

--1076 16--1513 --21---- 38312423 45--4039 --5448-- 83705956

[4/16, 16/16] [5/16, 15/16] [6/16, 14/16]

[7/16, 13/16]

Page 48: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Cache-Oblivious B-Tree Insert

4 10

4

16 21

16

10

38 45

38

54 83

54

45

21

4----1

Static Cache-Oblivious Tree

--1076 16--1513 --21---- 38312423 45--4039 --5448-- 83705956

Insert 37:

Page 49: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Cache-Oblivious B-Tree Insert

4 10

4

16 21

16

10

38 45

38

54 83

54

45

21

4----1

Static Cache-Oblivious Tree

--1076 16--1513 --21---- 38312423 45--4039 --5448-- 83705956

Insert 37:1. Find correct section of PMA location using static tree.

37

Page 50: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Cache-Oblivious B-Tree Insert

4 10

4

16 21

16

10

38 45

38

54 83

54

45

21

4----1

Static Cache-Oblivious Tree

--1076 16--1513 --21---- 38312423 45--4039 --5448-- 83705956

Insert 37:1. Find correct section of PMA location using static tree.2. Insert into PMA. This step possibly requires a rebalance.

37

Page 51: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Example Cache-Oblivious B-Tree

4 10

4

16 21

16

10

38 45

38

54 83

54

45

21

4----1

Static Cache-Oblivious Tree

--1076 16--1513 --21---- 37312423 45403938 --5448-- 83705956

[5/16, 15/16]

[6/16, 14/16]

2(a) Add 37 to packed memory array.

Page 52: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Example Cache-Oblivious B-Tree

4 10

4

16 21

16

10

38 45

38

54 83

54

45

21

4----1

Static Cache-Oblivious Tree

--1076 16--1513 --21---- 37312423 --403938 56544845 --837059

[5/16, 15/16]

[6/16, 14/16]

2(a) Add 37 to packed memory array.

2(b) Rebalance the PMA.

Page 53: Concurrent Cache-Oblivious  B-trees Using Transactional Memory

Cache-Oblivious B-Tree Insert

4 10

4

16 21

16

10

37 40

37

56 83

56

40

21

4----1

Static Cache-Oblivious Tree

--1076 16--1513 --21---- 37312423 --403938 56544845 --837059

Insert 37:1. Find correct section of PMA location using static tree.2. Insert into PMA. This step possibly requires a rebalance.3. Fix the static tree.