b tree file system report
DESCRIPTION
TRANSCRIPT
1-Introduction to B-Trees and Shadowing
1.1- B-tree-
In computer science, a B-tree is a tree data structure that keeps data sorted and allows
searches, sequential access, insertions, and deletions in logarithmic. The B-tree is a
generalization of a binary search tree in that a node can have more than two children (Comer
1979, p. 123). Unlike self, the B-tree is optimized for systems that read and write large blocks of
data. It is commonly used in databases andfilesystems.
B-Tree is the generalization of the binary search tree. B+-Tree can be consideredas B-Tree
variant, with an exception that in B+-Tree only leafs contain the data. Inbinary search trees, we
have nodes having single search-key and left sub-tree and rightsub-tree containing all nodes with
search-keys that are less than and greater than parentsearch-key respectively. In B+-Tree, we can
have multiple search-keys, and multiple child nodes.
1
In BST, the distance of leaf from the tree root is not fixed. It depends on thesequence of
insertions in BST. But in case of B or B+ trees, the insertion algorithm ensures that distance
between leaf and root is same for all cases. The Figure 1.2 shows the B+-Tree. In this example,
the ordering of the words is alphabetical. The size of node 1 is 2 and any more insertions in node
containing 2 search-keys will cause splitting of node and rebalancing operation. In case of B+-
Trees the leafs are chained together. This is because, anyway all search keys in adjacent leafs are
in sorted manner. So chaining can help for efficient sequential access to data associated with the
sorted keys in the leafs inbottom.
In B-trees, internal (non-leaf) nodes can have a variable number of child nodes within some
pre-defined range. When data is inserted or removed from a node, its number of child nodes
changes. In order to maintain the pre-defined range, internal nodes may be joined or split.
Because a range of child nodes is permitted, B-trees do not need re-balancing as frequently as
other self-balancing search trees, but may waste some space, since nodes are not entirely full.
The lower and upper bounds on the number of child nodes are typically fixed for a particular
implementation
B+- Tree
B-Trees ensures the logarithmic time key-search, insert and remove operations.B-Trees can be
used to represent files or directories in file-system. Files are typically represented by b-tree that
hold disk-extents i.e. set of free disk blocks in their leafs. In the next section, we will cover the
basic concept of the shadowing.
2
1.2 Shadowing-
Shadowing scheme is also known as copy-on-write (COW) scheme. Shadowing technique is
used to ensure atomic update to persistent data-structures in file-system. In this scheme, we look
at the storage in terms of fixed-size pages. There is a page tablewhich has a pointer to all valid
pages. Shadowing means that to update an on-disk page,the entire page is read into memory,
modified, and later written to disk at an alternatelocation. Now all we have to do is to update the
pointer in the page table to point tothis new page in the disk.
Byte-size of pointer is small and it can fit is one sector inthe disk. There are hard drives that
offer atomic sector upgrades and promise you thateither all of the old or new data in the sector.
This means you either have an old page ornewly written page. So atomic persistent updates are
ensured due to this scheme. It is apowerful mechanism to implement the crash recovery,
snapshots.
1.3 Problems with conventional B-Trees-
The entire file-system tree on the disk can be looked as made of fixed-size pages. When a page
is to be modified, it is read into memory, modified, and later written to some other location in the
disk. Now let us assume that the leaf in b-tree shown below is equivalent to one one-disk page. If
we try to modify the leaf, then page corresponding tothe leaf will be shadowed. Now, the next
immediate ancestor of this node should point tothis node. This means we will have to modify the
ancestor of this node. Again shadowingis involved, and this process continues up to the root
recursively.
So entire path up to the root need to be shadowed. We will call this type of shadowing as
strict shadowing.Now the one additional problem arises due to linking of the leafs in tree. Since
adjacent leaf should also point to the modified leaf, it is also needed to be shadowed.This process
leads to shadowing of the entire tree just because of modification in oneleaf. Remember, this all
is going to happen in the hard-disks! This lead to performancedegradation. The root of the
problem is leaf chaining.
3
To solve the issues related to concurrency, we use mutex locks or semaphores.Now, let’s
assume for while that there are no links in leafs. In normal b-tree, suppose weneed to modify a
single node, we take a lock on it, make changes and then release thelock. But if method of
updation is shadowing, then we know that changes propagate tothe root, making it necessary to
take locks on the way up to the tree root. So there israce to take the lock on the tree root. Waiting
for lock is time consuming process, andhence there is need of efficient synchronization.
The regular b-trees shuffle the keys between neighboring nodes for the re-balacing purpose
after key-insertion or deletion. If any leaf is modified then, then path up to rootwill be shadowed
by default. Suppose that the exchange of the keys happens betweennodes whose immediate
ancestor is not same, then additional path up to tree root willhave to be shadowed due to
modification because of exchange of keys.Removing a key and effects of re-balancing and
shadowing.
Removing a key and effects of re-balancing and shadowing.
So we can say that B-Trees + Shadowing = Expensive choice, if conventionalb-trees are used.
4
1.4 Modifications in conventional B-tree-
Ohad Rodeh, IBM Researcher, have suggested modifications to conventional b-treeand
algorithms related to it, for integrating b-tree schemes with shadowing technique. Wewill cover
few of them, related to problems discussed above.
1. To solve the problem related to shadowing of whole tree, the links between the leafsare
removed. Due to this, only the path up to the tree root needs to be shadowed.
2. In case of rebalancing operation, it is better to exchange the keys between nodeswhose
immediate ancestor is same, because this will involve shadowing of singlenode, which is better
instead of shadowing the another path up to the root involvingmany nodes.
5
2-Introduction and History of BTRFS
2.1 Introduction-
Btrfs is GPL-licensed copy-on-write file system for Linux. Its development beganat Oracle
Corporation in 2007. Principal BTRFS author is Chris Mason. Following areSome general points
about btrfs:
1. The core data structure of Btrfs is the copy-on-write B-tree which was originallyproposed by
IBM researcher Ohad Rodeh at a presentation at USENIX 2007.
2. Btrfs 1.0 (with finalized on-disk format) was originally slated for a late 2008 release,and was
finally accepted into the mainline kernel as of 2.6.29 in 2009.
3. Btrfs is intended to address the lack of pooling, snapshots, checksums in Linux filesystems.
4. Goal of btrfs was "to let Linux scale for the storage that will be available. Scalingis not just
about addressing the storage but also means being able to administerand to manage it with a
clean interface that lets people see what’s being used andmakes it more reliable."
5. Btrfs has a number of the same design ideas that reiser3/4(Chris Mason was workingon
reiserFS before starting his work on btrfs).
The maximum number of files is 18,446,744,073,709,551,616 or 2 to the 64 power of filesThe
maximum file length is 255 characters. The theoretical max file size limit is 16 EB, or 8EB . The
BTRFS file system helps reduce fragmentation. Storage devices usually show a loss of
performance due to fragmentation (usually when fuller). BTRFS does allow for Online.
When disk space should become full, it is possible to add space to the existing BTRFS
volume. The method refers to Online Resize. The BTRFS file system does not need to be
unmounted or taken offline. An existing volume can be added, or removed, from the volume to
If a volume has an existing ext3 or ext4 file system, it can be converted to BTRFS. The
conversion is an in-place conversion. This means that the existing data does not have to be
removed before the file system is converted. It is good practice to perform a backup in case .
6
2.1 History-
The core data structure of Btrfs — the copy-on-write B-tree — was originally proposed by
IBM researcher Ohad Rodeh at a presentation at USENIX2007. Chris Mason, an engineer
working on ReiserFS for SUSE at the time, joined Oracle later that year and began work on a
new file system based on these B-trees.
Btrfs 1.0 (with finalized on-disk format) was originally slated for a late 2008 release, and was
finally accepted into the mainline kernel as of 2.6.29 in 2009. Several Linux distributions began
offering Btrfs as an experimental choice of root file system during installation, including Arch
Linux, openSUSE 11.3, SLES 11 SP1, Ubuntu 10.10, Sabayon Linux, Red Hat Enterprise Linux
6, Fedora 15, MeeGo, Debian, and Slackware 13.37. In summer 2012, several Linux
distributions have moved Btrfs from experimental to production / supported status, including
SLES 11 SP2 and Oracle Linux 5 and 6, with the Unbreakable Enterprise Kernel Release 2.
In 2011, de-fragmentation features were announced for the Linux 3.0 kernel version. Besides
Mason at Oracle, Miao Xie at Fujitsu contributed performance improvements.In June 2012,
Chris Mason left Oracle for Fusion-io, and in November 2013 he left Fusion-io for Facebook. He
continues to work on Btrfs.
2.3 Why btrfs File-System?
Linux kernel currently supports almost as 140 file-systems. Most of these file-systems are
generally very good. So why do we need a new file-system even when we have thesemany file
systems? Reasons for the same are:
1. This file-system scales to very large storage. This is evident because maximum sizeof storage
that file-system can address is 16 EiB (264 Bytes).
2. This file-system is feature focused, providing features the other file-systems cannot.
3. Performance is important. This file-system does not intend to race with current filesystems
because they are anyway good. It’s the features that makes btrfs standout.
4. This file-system is administrator focused, so that it is easy to configure, and faulttolerant.
7
3- Specifications and Features of BTRFS
3.1 Features-
1. As of version 3.12 of the Linux kernel mainline, Btrfs implements the following features:
2. Mostly self-healing in some configurations due to the nature of copy on write
3. Online defragmentation
4. Online volume growth and shrinking
5. Online block device addition and removal
6. Online balancing (movement of objects between block devices to balance load)
7. Offline filesystem check
8. Online data scrubbing for finding errors and automatically fixing them for files with
redundant copies
9. RAID 0, RAID 1, RAID 5, RAID 6 and RAID 10
10. Subvolumes (one or more separately mountable filesystem roots within each disk
partition)
11. Transparent compression (zlib and LZO)
12. Snapshots (read-only or copy-on-write clones of subvolumes)
13. File cloning (copy-on-write on individual files, or byte ranges thereof)
14. Checksums on data and metadata (CRC-32C)
15. In-place conversion (with rollback) from ext3/4 to Btrfs
16. File system seeding (Btrfs on read-only storage used as a copy-on-write backing for a
writeable Btrfs)
17. Block discard support (reclaims space on some virtualized setups and improves wear
leveling on SSDs with TRIM)
18. Send/receive (saving diffs between snapshots to a binary stream)
19. Hierarchical per-subvolume quotas
20. Out-of-band data deduplication (requires user space tools)
8
3.2 Planned features include:
1. In-band data deduplication
2. Online filesystem check
3. Very fast offline filesystem check
4. Object-level RAID 0, RAID 1, and RAID 10[citation needed]
5. Incremental backup
6. Ability to handle swap files and swap partitions
7. Encryption
In 2009, Btrfs was expected to offer a feature set comparable to ZFS, developed by Sun
Microsystems.[40] After Oracle's acquisition of Sun in 2009, Mason and Oracle decided to
continue on with Btrfs development.[41]
Cloning-
Btrfs provides a clone operation which atomically creates a copy-on-write snapshot of a file.
Such cloned files are sometimes referred to as reflinks, in light of the associated Linux
kernel system calls.
By cloning, the file system does not create a new link pointing to an existing inode — it
instead creates a new inode that shares the same disk blocks as the original file. As a result, this
operation only works within the boundaries of the same Btrfs file system, while it can cross the
boundaries of subvolumes since Linux kernel version 3.6. The actual data blocks are not
becoming duplicated but, due to the copy-on-write nature of cloning, modifications to any of the
cloned files are not visible in their parent files and vice-versa.
This should not be confused with hard links, which are directory entries that associate
multiple file names with actual files on a file system. While hard links can be taken as different
names for the same underlying group of disk blocks (known as a file), cloning in Btrfs provides
independent files that are sharing their disk blocks as a form of data deduplication on the disk
block level. Any later changes to the content of such "dependent" files invoke the copy-on-write
mechanism, which creates independent copies of all altered disk blocks.
9
Support for this Btrfs feature was added in version 7.5 of the GNU coreutils, via the --
reflink option to the cp command.
Cloning can be especially effective in case of storing disk images of virtual machines or their
snapshots. Those are large files differing only in small portions, where the cloning provides both
their faster (instantaneous) copying and minimal consumption of storage space due to data
deduplication.
3.3 Snapshots-
Snapshots means the read only copy of data set frozen at particular point in time.Here we will
consider the case of the of writable snapshots of the tree structure.In btrfs, the cloning or
snapshot algorithm allowstheoretically large number snapshots.
In the above example, We have initial tree Tp. Here we have shown reference countof the each
block. Initially all the tree block have the reference-count as 1. Now wewill clone the btree using
tree Tq. The root of the Tq refer the same block as that of Tp.Now as there are two tree root
referencing some common blocks B and C, the referencecountof these blocks is increased by
one. So cloning algorithm just sets the pointer to the blocks referred by original tree root and
increase the reference-count of blocks referenced.Hence we can have as many as the snapshots
as we want, because pointer occupies less space than the actual data.
10
Now we consider the case of process of editing of the shared blocks. Figure 4.2shows the
example of the same. In this example, there are two tree root Tp and Tq.Now suppose that we are
editing the snapshot with respect to the Tq, and the leaf beingmodified is H. Node C is the
immediate ancestor of the leaf H. It should point to themodified copy of the leaf H. So block C is
shadowed to C0 which points to same blocks as that of C. The reference count of the C
isdecremented. Then the leaf H is shadowed toH0 and the reference count of the block H is
decremented by one. Hence due to this kind of sharing, the space requirement instead of copying
and modifying entire tree is low.
3.4 Subvolumes-
It is volume within volume which can be mounted separately. The user sees thevolumes as the
directories. There are benefits of doing this. We can, for example, makethe database directory as
subvolume, which will enable you to take snapshots for use withbackup. But like volumes in
other file-system, subvolume can’t be mounted anywhere inthe logical view of the directories. It
has to be mounted under the parent directory itself.
11
A subvolume in Btrfs is quite different from the usual LVM logical volumes. With LVM, a
logical volume is a block device in its own right — while this is not the case with Btrfs. A Btrfs
subvolume is not a separate block device, and it cannot be treated or used that way.
Instead, a Btrfs subvolume can be thought of as a separate POSIX file namespace. This
namespace can be accessed either through the top-level subvolume of the file system, or it can be
mounted on its own and accessed separately by specifying the subvol or subvolid option
to mount. When accessed through the top-level subvolume, subvolumes are visible and accessed
as its subdirectories.
Subvolumes can be created at any place within the file system hierarchy, and they can also be
nested. Nested subvolumes appear as subdirectories within their parent subvolumes, similar to
the way top-level subvolume presents its subvolumes as subdirectories. Deleting a subvolume
deletes all subvolumes below it in the nesting hierarchy, and for this reason the top-level
subvolume cannot be deleted.
Any Btrfs file system always has a default subvolume, which is initially set to be the top-level
subvolume, and it is mounted by default if no subvolume selection option is passed to mount. Of
course, the default subvolume can be changed as required.
3.5 Send/receive-
Given any pair of subvolumes (or snapshots), Btrfs can generate a binary diff between them
(by using the btrfs send command) that can be replayed later (by using btrfs receive), possibly on
a different Btrfs file system. The send/receive feature effectively creates (and applies) a set of
data modifications required for converting one subvolume into another.
The send/receive feature can be used with regularly scheduled snapshots for implementing a
simple form of file system master/slave replication, or for the purpose of performing incremental
backups.
3.6 Quota groups-
12
A quota group (or qgroup) imposes an upper limit to the space a subvolume or snapshot may
consume. A new snapshot initially consumes no quota because its data is shared with its parent,
but thereafter incurs a charge for new files and copy-on-write operations on existing files. When
quotas are active, a quota group is automatically created with each new subvolume or snapshot.
These initial quota groups are building blocks which can be grouped (with the btrfs
qgroup command) into hierarchies to implement quota pools.
Quota groups only apply to subvolumes and snapshots, while having quotas enforced on
individual subdirectories is not possible.In-place ext2/3/4 conversion
As the result of having very little metadata anchored in fixed locations, Btrfs can warp to fit
unusual spatial layouts of the backend storage devices. The btrfs-convert tool exploits this ability
to do an in-place conversion of any ext2/3/4 file system, by nesting the equivalent Btrfs metadata
in its unallocated space — while preserving an unmodified copy of the original file system.
The conversion involves creating a copy of the whole ext2/3/4 metadata, while the Btrfs files
simply point to the same blocks used by the ext2/3/4 files. This makes the bulk of the blocks
shared between the two filesystems before the conversion becomes permanent. Thanks to the
copy-on-write nature of Btrfs, the original versions of the file data blocks are preserved during
all file modifications. Until the conversion becomes permanent, only the blocks that were
marked as free in ext2/3/4 are used to hold new Btrfs modifications, meaning that the conversion
can be undone at any time.
All converted files are available and writable in the default subvolume of the Btrfs. A sparse file
holding all of the references to the original ext2/3/4 filesystem is created in a separate
subvolume, which is mountable on its own as a read-only disk image, allowing both original and
converted file systems to be accessed at the same time. Deleting this sparse file frees up the
space and makes the conversion permanent.
3.7 Seed devices-
When creating a new Btrfs, an existing Btrfs can be used as a read-only "seed" file system.
The new file system will then act as a copy-on-write overlay on the seed. The seed can be later
detached from the Btrfs, at which point the rebalancer will simply copy over any seed data still
referenced by the new file system before detaching. Mason has suggested this may be useful for
13
aLive CD installer, which might boot from a read-only Btrfs seed on optical disc, rebalance itself
to the target partition on the install disk in the background while the user continues to work, then
eject the disc to complete the installation without rebooting.
3.8 Encryption-
Though Chris Mason said in his interview in 2009 that encryption was planned for Btrfs, this
is unlikely to be implemented for some time, if ever, due to the complexity of implementation
and pre-existing tested and peer-reviewed solutions. The current recommendation for encryption
with Btrfs is to use a full-disk encryption mechanism such as dm-crypt/LUKS on the underlying
devices, and to create the Btrfs filesystem on top of that layer (and that if a RAID is to be used
with encryption, encrypting a dm-raid device or a hardware-RAID device gives much faster disk
performance than dm-crypt overlaid by Btrfs' own filesystem-level RAID features).
3.9 Checking and recovery-
Unix systems traditionally rely on "fsck" programs to check and repair filesystems.
The btrfsck program is now available but, as of May 2012, it is described by the authors as
"relatively new code which has "not seen widespread testing on a large range of real-life
breakage", and that "may cause additional damage in the process of repair".
There is another tool, named btrfs-restore, that can be used to recover files from an unmountable
filesystem, without modifying the broken filesystem itself (i.e., non-destructively).
In normal use, Btrfs is mostly self-healing and can recover from broken root trees at mount time,
thanks to making periodic data flushes to permanent storage every 30 seconds (which is the
default period). Thus, isolated errors will cause a maximum of 30 seconds of filesystem changes
to be lost at the next mount. This period can be changed by specifying a desired value (in
seconds) for the commit mount option.
14
3- Design
Ohad Rodeh's original proposal at USENIX 2007 noted that B+ trees, which are widely used
as on-disk data structures for databases, could not efficiently support copy-on-write-based
snapshots because its leaf nodes were linked together: if a leaf was copy-on-written, its siblings
and parents would have to be as well, as would their siblings and parents and so on until the
entire tree was copied. He suggested instead a modified B-tree (which has no leaf linkage), with
a refcount associated to each tree node but stored in an ad-hoc free map structure and certain
relaxations to the tree's balancing algorithms to make them copy-on-write friendly. The result
would be a data structure suitable for a high-performance object store that could perform copy-
on-write snapshots, while maintaining good concurrency.
At Oracle later that year, Chris Mason began work on a snapshot-capable file system that
would use this data structure almost exclusively—not just for metadata and file data, but also
recursively to track space allocation of the trees themselves. This allowed all traversal and
modifications to be funneled through a single code path, against which features such as copy-on-
write, checksumming and mirroring needed to be implemented only once to benefit the entire file
system.
Btrfs is structured as several layers of such trees, all using the same B-tree implementation.
The trees store generic items sorted on a 136-bit key. The first 64 bits of the key are a
unique object id. The middle 8 bits are an item type field; its use is hardwired into code as an
item filter in tree lookups. Objects can have multiple items of multiple types. The remaining
right-hand 64 bits are used in type-specific ways. Therefore items for the same object end up
adjacent to each other in the tree, ordered by type. By choosing certain right-hand key values,
objects can further put items of the same type in a particular order.
Interior tree nodes are simply flat lists of key-pointer pairs, where the pointer is the logical
block number of a child node. Leaf nodes contain item keys packed into the front of the node and
item data packed into the end, with the two growing toward each other as the leaf fills up.
15
In this section, we will cover basic data structures that are used in the btrfs. Everytree block is
either a leaf or node. Every leaf and node begins with the header.
//Node :
struct btrfs_node {
struct btrfs_header header ;
struct btrfs_key_ptr ptrs [ ] ;
}
// Leaf :
struct b t r f s_l e a f {
struct bt r f s_heade r header ;
struct bt r f s_i t em i tems [ ] ;
}
// header ( p r e s ent in node and l e a f )
struct bt r f s_heade r {
u8 csum[ 3 2 byt e s ] ;
u8 f s i d [ 1 6 ] ;
__le64 blocknr ;
__le64 g ene r a t i on ;
__le64 owner ;
__le16 nr i t ems ;
__le16 f l a g s ;
u8 l e v e l ;
}
//Key p t r s ( p r e s ent in node )
struct btrfs_key_ptr {
struct btrfs_disk_key ;
6
__le64 bl o ckpt r ;
__le64 g ene r a t i on ;
16
}
// Items ( p r e s ent in l e a f )
struct bt r f s_i t em {
struct btrfs_disk_key key ;
__le32 o f f s e t ;
__le32 s i z e ;
}
// Items ( p r e s ent in i tems in the l e a f )
struct bt r f s_key {
u64 o b j e c t i d ;
u32 f l a g s ;
u64 o f f s e t ;
}
Every tree node carries the header. The block header contains a checksum for theblock
contents, the uuid of the filesystem that owns the block, the level of the block inthe tree, and the
block number where this block is supposed to live. These fields allow the contents of the
metadata to be verified when the data is read.The generation fieldcorresponds to the transaction
id that allocated the block. So nodes have pointer array which points to other leafs or node (i.e.
some blocks on disk) using blockptr field in key.
Now we will look at more details about the leaf structure.Leaf node containsthe header and
the array of items. Now, each logical object in file-system (e.g. files anddirectories) contains
various items. B-tree implementation are used to store these itemssorted on a 136-bit key (struct
btrfs-key) in leaf. The first 64 bits of the key are a objectidwhich is unique id for each logical
object. This id is reported as the inode number. Typesof items in leaf can be inode, directory
entries, extent and so on, associated with object.
Next field in btrfs-key is type which tells information about type of item associated with
object.Next field in key is offset which tell about the position of item in leaf.Now, interesting
thing is, as objectid forms MSB in the btrfs-key of items, so allitems related to the object ends up
being adjacent to each other i.e. they are automaticallygrouped together. This means metadata
and optionally data associated with an object isgrouped together. This results in compact packing
of the data and metadata. Suppose thethat there are N items in the leaf, then index data-item
17
associated with item[X] is dataitem[N-X]. This means the items and the data associated with the
items grow towards
each other in leaf. Following Figure 3.1 summarize this paragraph.Now let’s see some
information about the disk layout. The scheme to storeitems in leaf associated with an object is
space and time efficient as well. Normally, filesystems put only one kind of data - bitmaps, or
inodes, or directory entries - in any given
Leaf structure in Btrfs
file system block. This wastes disk space, since unused space in one kind of block can’tbe used
for any other purpose, and it wastes time, since getting to one particular pieceof file data requires
reading several different kinds of metadata, all located in differentblocks in the file system. In
btrfs, items are packed together (or pushed out to leaves) inarrangements that optimize both
access time and disk space. You can see the difference in these (very schematic, very simplified)
diagrams.
Old-school filesystems tend to organize data as shown in Btrfs, instead, creates a disk
layoutwhich looks as shown in As we can see, there is no fixed block for the inodes, bitmaps, dir
entries, file data or block pointer. The blocks associated with these can overlap for the sake of
compaction.The red arrows in Figure shows the disk seeks to locate data or meta-data.The red
portion in the blocks shows the unused or wasted space. As all metadata relatedto an object is
closely packed, the there are less disk seek. Hence the scheme is time andspace efficient.
18
There are various b-trees in btrfs. Everything is stored the btrees. There issingle tree
manipulation code. Also trees does not care about the object types in theb-tree. So same code can
be reused for all kinds of trees that are there in the btrfs. Hence scheme are not only space and
time efficient, but code efficient too.
Data organization in old-school filesystems
19
Data organization in Btrfs
4.1 Root tree-
Every tree appears as an object in the root tree (or tree of tree roots). Some trees, such as file
system trees and log trees, have a variable number of instances, each of which is given its own
object id. Trees which are singletons (the data relocation, extent and chunk trees) are assigned
special, fixed object ids ≤256. The root tree appears in itself as a tree with object id 1.
Trees refer to each other by object id. They may also refer to individual nodes in other trees as a
triplet of the tree's object id, the node's level within the tree and its leftmost key value. Such
references are independent of where the tree is actually stored.
4.2 File system tree-
subvolume. Subvolumes can nest, in which case they appear as a directory item (described
User-visible files and directories all live in a file system tree. There is one file system tree per
below) whose data is a reference to the nested subvolume's file system tree.
Within the file system tree, each file and directory objects has an inode item. Extended
attributes and ACL entries are stored alongside in separate items.
20
Within each directory, directory entries appear as directory items, whose right-hand key
values are a CRC32C hash of their filename. Their data is a location key, or the key of the inode
item it points to. Directory items together can thus act as an index for path-to-inode lookups, but
are not used for iteration because they are sorted by their hash, effectively randomly
permuting them. This means user applications iterating over and opening files in a large
directory would thus generate many more disk seeks between non-adjacent files—a notable
performance drain in other file systems with hash-ordered directories such as ReiserFS, ext3
(with Htree-indexes enabled and ext4, all of which have TEA-hashed filenames. To avoid this,
each directory entry has a directory index item, whose right-hand value of the item is set to a per-
directory counter that increments with each new directory entry. Iteration over these index items
thus returns entries in roughly the same order as they are stored on disk.
Besides inode items, files and directories also have a reference item whose right-hand key
value is the object id of their parent directory. The data part of the reference item is the filename
that inode is known by in that directory. This allows upward traversal through the directory
hierarchy by providing a way to map inodes back to paths.
Files with hard links in other directories have multiple reference items, one for each parent
directory. Files with hard links in the same directory pack all of the links' filenames into the
same reference item. This was a design flaw that limited the number of same-directory hard links
to however many could fit in a single tree block. (On the default block size of 4 KB, an average
filename length of 8 bytes and a per-filename header of 4 bytes, this would be less than 350.)
Applications which made heavy use of same-directory hard links, such
as git, GNUS, GMame and BackupPCwere later observed to fail after hitting this limit.] The limit
was eventually removed (and as of October 2012 has been merged pending release in Linux by
introducing spilloverextended reference items to hold hard link filenames which could not
otherwise fit.
4.3 Relocation trees-
Defragmentation, shrinking and rebalancing operations require extents to be relocated. However,
doing a simple copy-on-write of the relocating extent will break sharing between snapshots and
21
consume disk space. To preserve sharing, an update-and-swap algorithm is used, with a
special relocation tree serving as scratch space for affected metadata. The extent to be relocated
is first copied to its destination. Then, by following back references upward through the affected
subvolume's file system tree, metadata pointing to the old extent is progressively updated to
point at the new one; any newly updated items are stored in the relocation tree. Once the update
is complete, items in the relocation tree are swapped with their counterparts in the affected
subvolume, and the relocation tree is discarded.
(b)
(a) File system forest (b) the changes that occur after modification
22
4.4 Extents-
File data are kept outside the tree in extents, which are contiguous runs of disk blocks. Extent
blocks default to 4KiB in size, do not have headers and contain only (possibly compressed) file
data. In compressed extents, individual blocks are not compressed separately; rather, the
compression stream spans the entire extent.
Files have extent data items to track the extents which hold their contents. The item's right-hand
key value is the starting byte offset of the extent. This makes for efficient seeks in large files
with many extents, because the correct extent for any given file offset can be computed with just
one tree lookup.
Snapshots and cloned files share extents. When a small part of a large such extent is overwritten,
the resulting copy-on-write may create three new extents: a small one containing the overwritten
data, and two large ones with unmodified data on either side of the overwrite. To avoid having to
re-write unmodified data, the copy-on-write may instead create bookend extents, or extents
which are simply slices of existing extents. Extent data items allow for this by including an offset
into the extent they are tracking: items for bookends are those with non-zero offsets.
If the file data is small enough to fit inside a tree node, it is instead pulled in-tree and stored
inline in the extent data item. Each tree node is stored in its own tree block—a single
uncompressed block with a header. The tree block is regarded as a free-standing, single-block
extent.
23
5-Performance(Disk)-
24
6- Limitations
Here we will list few of the problems to be addressed.
1. Transactions
(a) Btrfs supports limited transactions without Atomicity-Consistency-Isolation-Durability
semantics.
(b) Only one transaction may run at a time which is not atomic wrt storage.
2. Checking and recovery
(a) fsck tool is available but not recommended as of now.
25
7-Future Development
Some of the planned features are:
1. Encryption
2. Data Deduplication
3. Parity Based RAID(RAID5 and RAID6)
4. Ability to handle swap
5. Incremental dumps
26
7-References
[1] Ohad Rodeh, Research paper on "B-trees, Shadowing, and Clones" oracle in 2008
[2] Kerner, Sean Michael "A Better File System For Linux". InternetNews.com. Archived from
the original on 24 June 2012. Retrieved 2008-10-30.
[4] Valerie Aurora, Article on "A short history of btrfs",IEEE Magazine in 2011
[5]MACEDONIA,M.R.: “B-tree file system” IEEE Commune. Magazine , vol. 47 pp. S30-S38 ,
Mar 2007
[6] Mason, Chris. "Btrfs: a copy on write, snapshotting FS”, (2007-06-12)
[7] Brown, Eric "Linux 3.0 scrubs up Btrfs, gets more Xen". Linux devices (eWeek). Archived
from the original on 2013-01-27. Retrieved 8 November 2011.
27
28