the full path to martin farach-colton, william jannen, rob ...€¦ · the full path to full-path...

Post on 17-Aug-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The Full Path to Full-Path Indexing

Yang Zhan, Alex Conway, Yizheng Jiao, Eric Knorr, Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Donald E. Porter, Jun Yuan

Talk Overview▪ What is full-path indexing and its benefits?

• locality▪ What are the challenges?

• renames▪ How do we overcome them?

• data structure techniques: tree surgery and lifting

Conventional file systems use inodes/

foo

bar

var

A

B

C

D

E

data of D...

...

...data of C

data of A

...data of E

...data of B

disk...

worst case: no relation between the logical and physical locations

Inode file systems show no locality in the worst case/

foo

bar

var

A

B

C

D

E

data of D...

...

...data of C

data of A

...data of E

...data of B

disk...

read all files under /foo

data of D

data of C

data of E

multiple random IOs

Full-path indexing file systems use full-paths▪ Full-path indexing file systems index metadata and

data in key-value stores using full-paths

Full-path indexing file systems use full-paths/

foo

bar

var

A

B

C

D

E

disk

logically related things are physically related

.../A: data/bar/B: data/foo/C: data/foo/var/D: data

.../foo/var/E: data

Full-path file systems ensure locality/

foo

bar

var

A

B

C

D

E

disk

read all files under /foo

one large IO

.../A: data/bar/B: data/foo/C: data/foo/var/D: data

.../foo/var/E: data

/foo/C: data/foo/var/D: data/foo/var/E: data

Scans are fast in full-path file systems

40

ext4 btrfs BetrFS-FPxfs zfs

seco

nds

Time to grep the linux source directory (lower is better)

full-path file systems are 3.3x faster than other file systems

60

20

0

Talk Overview▪ What is full-path indexing and its benefits?

• locality▪ What are the challenges?

• renames▪ How do we overcome them?

• data structure techniques: tree surgery and lifting

Renames are cheap in inode file systems/

foo

bar

var

A

B

C

D

E

data of D...

...

...data of C

data of A

...data of E

...data of B

disk...

rename /foo/var to /bar/var

one pointer swing

Renames are cheap in inode file systems/

foo

bar

var

A

B

C

D

E

data of D...

...

...data of C

data of A

...data of E

...data of B

disk...

rename /foo/var to /bar/var

no change

Renames seem expensive with full-path indexing/

foo

bar

var

A

B

C

D

E

.../A: data/bar/B: data/foo/C: data/foo/var/D: data

.../foo/var/E: data

disk

rename /foo/var to /bar/var

/foo/var/D: data/foo/var/E: data

need to maintain key order

Renames seem expensive with full-path indexing

.../A: data/bar/B: data/foo/C: data

...

/foo/var/D: data/foo/var/E: data

.../A: data/bar/B: data

/foo/C: data...

/bar/var/D: data/bar/var/E: data

1. move kv pairs2. change key prefixes

Expensive when rename size is large

Renaming big files are slow in full-path file systems

Renaming big files are slow in full-path file systems

slow

Renaming big files are slow in full-path file systems

Rename the Linux source directory takes 20 seconds

slow

Inode vs. Full-path indexing

rename locality

inode file systems

full-path file systems

We want to get decent renames with good locality

▪ In FAST 2016, zoning was introduced to BetrFS▪ Zoning tries to get both locality and fast renames

Zoning tries to solve the rename problem

Zoning Example/

foo

bar

var

A

B

C

D

E

full-path indexing within one zone

indirection between zones zone maintenance cost:a zone is too large: zone splita zone is too small: zone merge

Zoning tries to achieve both fast renames and locality

small zones(inode file systems)

big zones(full-path file systems)

scan cost rename cost

sweet spot

Zoning performance

rename locality other operations

zoning

Zoning achieves cheap renames

Zoning achieves cheap renames

big files form their own zones, renaming them is cheap

Zoning performance

rename locality other operations

zoning

Zoning has relatively good locality

0

20

40

60

ext4 btrfs BetrFS-FPxfs zfs

seco

nds

Time to grep the linux source directory (lower is better)

BetrFS-Zone

zoning is still 2.2x faster than other file systems, but 33% slower than full-path indexing

Zoning performance

rename locality other operations

zoning

Zone maintenance can be expensive

10k zone splits

50% drop

Zoning performance

rename locality other operations

zoning

Zoning is not the answer

Talk Overview▪ What is full-path indexing and its benefits?

• locality▪ What are the challenges?

• renames▪ How do we overcome them?

• data structure techniques: tree surgery and lifting

Moving is expensive in an array/

foo

bar

var

A

B

C

D

E

disk

.../A/bar/B/foo/C/foo/var/D

.../foo/var/E

Moving kv pairs looks hard in an array, but they are stored in a Bε-tree in BetrFS

Bε-trees sometimes allow easy movesCan be viewed as B-trees in this talk

/foo/var/D/foo/var/E

/foo/C

/bar/B

/A

leaves store kv pairs

/bar/B/foo/C

/A

interior nodes store pivots and pointers to children

Bε-trees sometimes allow easy moves

/foo/var/D/foo/var/E

/foo/C

/bar/B

/A

/bar/B/foo/C

/A

/foo/C

move /foo/C to /C

This move can be done by changing pivots and a pointer swing

/foo/var/E

Bε-trees sometimes allow easy moves

/foo/var/D/foo/var/E

/foo/C

/bar/B

/A

/bar/B/foo/C

/A

/foo/C

move /foo/C to /C

This move can be done by changing pivots and a pointer swing

/C

Problem 1: keys are not updated (should be /C)

/foo/var/E

Problem 2: not easy to move /foo/var/E

/C

▪ Two problems:• need to get an isolated subtree

▪ tree surgery in O(Bε-tree height) IOs• need to update keys

▪ lifting, no additional IO cost▪ The whole solution is called range-rename

A rename can be done by moving a subtree

Tree surgery slices out an isolated subtree...

...

......

...

......

......

......

......

...

.........

...

.........

...

get a subtree of range (/redmin, /redmax)

completely in range

need to take care of fringe nodes

Tree surgery slices out an isolated subtree...

...

......

...

......

......

......

......

...

.........

...

.........

...

get a subtree of range (/redmin, /redmax)

1. walk down the tree with min and max key

Tree surgery slices out an isolated subtree...

...

......

...

......

......

......

......

...

.........

...

.........

...

get a subtree of range (/redmin, /redmax)

1. walk down the tree with min and max key

2. start node splits from leaves

Tree surgery slices out an isolated subtree...

...

......

...

......

......

......

......

...

......

......

.........

...

get a subtree of range (/redmin, /redmax)

1. walk down the tree with min and max key

2. start node splits from leaves

3. keep splitting until two splits converge

Tree surgery slices out an isolated subtree...

...

......

...

......

......

......

......

...

......

......

.........

...

get a subtree of range (/redmin, /redmax)

1. walk down the tree with min and max key

2. start node splits from leaves

3. keep splitting until two splits converge

Tree surgery slices out an isolated subtree...

...

......

...

......

......

......

......

...

......

......

.........

...

get a subtree of range (/redmin, /redmax)

1. walk down the tree with min and max key

2. start node splits from leaves

3. keep splitting until two splits converge

we now have a isolated subtree

▪ Reasons:• to setup pivots for the source tree• POSIX allows renames to overwrite files

Tree surgery also slices at the destination

Tree surgery finishes with a pointer swing

......

...

▪ rename /red to /blue

...

......

.........

...

...

......

...

Tree surgery finishes with a pointer swing

......

...

▪ rename /red to /blue

...

......

.........

...

...

......

...

Tree surgery completes in O(Bε-tree height) IOs...

...

......

...

......

......

......

......

...

.........

...

.........

...

most data: not touched

each subtree slicing go through two root-to-leaf path

▪ Revert the IO cost to O(subtree size)

▪ Solution: lifting to convert Bε-trees to lifted Bε-trees• prefix updates are free

Updating all keys in the subtree is expensive

Lifting lifts the common prefix of two pivots

/redmax

/orange/redmin

/yellow/red/v/red/e

/red/d/red/c

/red/x/red/w

/red/z/red/y

/red/b/red/a

......

...

......

Common_prefix(/redmin, /redmax) = /red

all keys in the subtree must have prefix /red, there is no need to store that

lift /red

Lifting lifts the common prefix of two pivots

/redmax

/orange/redmin

/yellow/v/e

/d/c

/x/w

/z/y

/b/a

......

...

......

lift /red

Search for /red/z

/red lifted, search for /z

/red lifted, search for /z

/red lifted, search for /z

Lifting lifts the common prefix of two pivots

/redmax

/orange/redmin

/yellow/v/e

/d/c

/x/w

/z/y

/b/a

......

...

......

lift /red

/bluemax

/aqua/bluemin

/orange

lift /blue

Lifting lifts the common prefix of two pivots

/blue/v/blue/e

/blue/d/blue/c

/blue/x/blue/w

/blue/z/blue/y

/blue/b/blue/a

......

...

......

/bluemax

/aqua/bluemin

/orange

lift /blue

all keys in the subtree are updated automatically after the pointer swing

▪ Lifting happens at all times▪ Cost of other operations:

• collect lifted parts along the root-to-leaf path• no additional IO

▪ Cost of maintaining key lifting• key lifting can only change in node splits/merges• no additional IO

Lifting does not introduce additional IOs

▪ Range-rename performs tree surgery• O(Bε-tree height) IOs

▪ Key/value pairs are stored in lifted Bε-trees• keys are updated after tree surgery without cost

Range-rename completes in O(Bε-tree height) IOs

Evaluation

other operations rename applications

range-rename

▪ Dell optilex destop• 4-core 3.4 GHz i7, 4 GB RAM• 7200 RPM 500 GB Seagate Barrcuda

Experimental Setup

Tokubench

no cliff

Range-rename doesn’t charge other operations as much as zoning

Evaluation

other operations rename applications

range-rename

Rename Throughput

Rename Throughput

In normal cases, range-rename can rename as fast as other file systems

Evaluation

other operations rename applications

range-rename

IMAP benchmark

ext4 btrfs BetrFS-Zonexfs zfs BetrFS-RR

The throughput of 4 threads operating on the dovecot server (higher is better)

0

50

100

150

ops/

s

BetrFS-RR is 12% faster than BetrFS-Zone

rsync benchmark

ext4 btrfs BetrFS-Zonexfs zfs BetrFS-RR0

20

40

MB

/s

The throughput of rsync to copy the linux directory (higher is better)

BetrFS-RR is 13% faster than BetrFS-Zone

BetrFS-RR is faster than BetrFS-Zone in application benchmarks

Evaluation

other operations rename applications

range-rename

Evaluation

other operations rename applications

range-rename

● Lower taxes on non-rename operations○ A few regressions

● Worst case rename costs: logarithmic in size● Additional opportunities for range rename

in paper

▪ BetrFS with range-rename• maintain full-path indexing• decent rename performance• no tradeoff: locality, rename and other operations

Conclusion

Web: betrfs.orgCode: https://github.com/oscarlab/betrfs

Email: betrfs@googlegroups.com

top related