freebsd/zfs - last word in operating/file systems
TRANSCRIPT
![Page 2: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/2.jpg)
The beginning...• ZFS released by SUN under
CDDL license• available in Solaris / OpenSolaris
only• ongoing Linux port for FUSE
framework (userland); started asSoC project
• ongoing port for MacOS X(read-only support in Leopard)
![Page 3: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/3.jpg)
Features...
• ZFS has many very interestingfeatures, which makes it one ofthe most wanted file systems
![Page 4: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/4.jpg)
Features...
• dynamic stripping – use the entirebandwidth available,
• RAID-Z (RAID-5 without“write hole” (more like RAID-3actually)),
• RAID-1,• 128 bits (POSIX limits FS to 64 bits)...
(think about 65 bits)
![Page 5: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/5.jpg)
Features...
• pooled storage• no more volumes/partitions• does for storage what VM did for memory
• copy-on-write model• transactional operation
• always consistent on disk• no fsck, no journaling
• intelligent synchronization(resilvering)
• synchronize only valid data
![Page 6: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/6.jpg)
Features...
• snapshots• very cheap, because of COW model
• clones• writtable snapshots
• snapshot rollback• very handy “undo” operation
• end-to-end data integrity• detects and corrects silent data corruption caused
by any defect in disk, cable, controller, driveror firmware
![Page 7: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/7.jpg)
Features...
• built-in compression• lzjb, gzip
• self-healing• return good data and fix corrupted data
• endian-independent• always write in native endianess
• simplified administration• per-filesystem encryption
• work in progress
![Page 8: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/8.jpg)
Features...
• delegated administration• user-administrable file systems
• administration from within a zone• from within a jail in FreeBSD
![Page 9: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/9.jpg)
Volume
FS
Volume
FS
Volume
FS
Storage Pool
ZFS ZFS ZFS
●Traditional Volumes● abstraction: virtual disk● volume/partition for each FS● grow/shrink by hand● each FS has limited bandwidth● storage is fragmented
●ZFS Pooled Storage● abstraction: malloc/free● no partitions to manage● grow/shrink automatically● all bandwidth always available● all storage in the pool is shared
FS/Volume model vs. ZFS
![Page 10: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/10.jpg)
ZFS Self-Healing
![Page 11: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/11.jpg)
xVM mirror
File System
1. Application issues a read. Mirror reads the first disk, which has a corrupt block.It can't tell...
Application
xVM mirror
File System
2. Volume manager passes the bad block to file system. If it's a metadata block, the system panics. If not...
Application
xVM mirror
File System
3. File system returns bad data to the application...
Application
Traditional mirroring
![Page 12: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/12.jpg)
ZFS mirror
1. Application issues a read. ZFS mirror tries the first disk. Checksum reveals that the block is corrupt on disk.
Application
ZFS mirror
2. ZFS tries the second disk. Checksum indicates that the block is good.
Application
ZFS mirror
3. ZFS returns good data to the application and repairs the damaged block.
Application
Self-Healing data in ZFS
![Page 13: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/13.jpg)
Porting...• very portable code (started to work
after 10 days (and nights) of porting)• few ugly Solaris-specific details• few ugly FreeBSD-specific
details (VFS, buffer cache)• ZPL was hell (ZFS POSIX layer);
yes, this is the thing which VFStalks to
![Page 14: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/14.jpg)
Solaris compatibility layer
contrib/opensolaris/ - userland code taken from OpenSolarisused by ZFS (ZFS control utilities, libraries, test tools)
compat/opensolaris/ - userland API compatibility layer(Solaris-specific functions missing in FreeBSD)
cddl/ - Makefiles used to build userland libraries and utilitiessys/contrib/opensolaris/ - kernel code taken from OpenSolaris
used by ZFSsys/compat/opensolaris/ - kernel API compatibility layersys/modules/zfs/ - Makefile for building ZFS kernel module
![Page 15: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/15.jpg)
ZFS connection points in the kernel
ZFS
GEOM(ZVOL)
VFS(ZFS file systems)
/dev/zfs(userland
communication)
GEOM(VDEV)
![Page 16: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/16.jpg)
How does it look exactly...
ZVOL/GEOMproviders only
VDEV_GEOMconsumers only
VDEV_FILE VDEV_DISK
GEOM
GEOM VFS
ZPL ZFS
many other layers
VFS
![Page 17: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/17.jpg)
Snapshots
• contains @ in its name:# zfs listNAME USED AVAIL REFER MOUNTPOINTtank 50,4M 73,3G 50,3M /tanktank@monday 0 - 50,3M -tank@tuesday 0 - 50,3M -tank/freebsd 24,5K 73,3G 24,5K /tank/freebsdtank/freebsd@tuesday 0 - 24,5K -
• mounted on first access under/mountpoint/.zfs/snapshot/<name>
• hard to NFS-export• separate file systems have to be visible when its
parent is NFS-mounted
![Page 18: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/18.jpg)
NFS is easy
# mountd /etc/exports /etc/zfs/exports# zfs set sharenfs=ro,maproot=0,network=192.168.0.0,mask=255.255.0.0 tank# cat /etc/zfs/exports# !!! DO NOT EDIT THIS FILE MANUALLY !!!
/tank -ro -maproot=0 -network=192.168.0.0 -mask=255.255.0.0 /tank/freebsd -ro -maproot=0 -network=192.168.0.0 -mask=255.255.0.0
• we translate options to exports(5) formatand SIGHUP mountd(8) daemon
![Page 19: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/19.jpg)
Missing bits in FreeBSD needed by ZFS
![Page 20: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/20.jpg)
Sleepable mutexes • no sleeping while holding mutex(9)• Solaris mutexes implemented
on top of sx(9) locks (performanceimprovements by Attilio Rao)
• condvar(9) version that operates onany locks, not only mutexes(implemented by John Baldwin)
![Page 21: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/21.jpg)
GFS (Generic Pseudo-Filesystem) • allows to create “virtual” objects
(not stored on disk)• in ZFS we have:.zfs/.zfs/snapshot.zfs/snapshot/<name>/
![Page 22: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/22.jpg)
VPTOFH • translates vnode to a file handle• VFS_VPTOFH(9) replaced with
VOP_VPTOFH(9) to support NFSexporting of GFS vnodes
• its just better that way – confirmedby Kirk McKusick
![Page 23: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/23.jpg)
lseek(2) SEEK_{DATA,HOLE} • SEEK_HOLE – returns the offset
of the next hole• SEEK_DATA – returns the offset
of the next data• helpful for backup software• not ZFS-specific
![Page 24: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/24.jpg)
Integration with jails
• ZFS nicely integrates with zoneson Solaris, so why not to use itwith FreeBSD's jails?
• pools can only be managed fromoutside a jail
• zfs file systems can be managedfrom within a jail
![Page 25: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/25.jpg)
Integration with jailsmain# zpool create tank mirror da0 da1main# zfs create -o jailed=on tank/jailmain# jail hostname /jail/root 10.0.0.1 /bin/tcshmain# zfs jail <id> tank/jail
jail# zfs create tank/jail/homejail# zfs create tank/jail/home/pjdjail# zfs snapshot tank/jail/home@today
![Page 26: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/26.jpg)
Testing correctness
• ztest (libzpool)• “a product is only as good as its test suite”• runs most of the ZFS code in userland• probably more abuse in 20 seconds that you'd
see in a lifetime• fstest regression test suite
• 3438 tests in 184 files• # prove -r /usr/src/tools/regression/fstest/tests• tests: chflags(2), chmod(2), chown(2), link(2),
mkdir(2), mkfifo(2), open(2), rename(2),rmdir(2), symlink(2), truncate(2), unlink(2)
![Page 27: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/27.jpg)
Performance
![Page 28: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/28.jpg)
Before showing the numbers...
• a lot has been done in this area• the buffer cache bypass• new sx(9) implementation• namecache• shared vnode locking• mmap(2) fixes
![Page 29: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/29.jpg)
Untaring src.tar four times one by one
0
20
40
60
80
100
120
140
160
180
200
220
UFS+SU
ZFS
Tim
e in
sec
onds
(le
ss is
bet
ter)
![Page 30: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/30.jpg)
Removing four src directories one by one
0
10
20
30
40
50
60
70
80
90
100
UFS+SU
ZFS
Tim
e in
sec
onds
(le
ss is
bet
ter)
![Page 31: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/31.jpg)
Untaring src.tar four times in parallel
0
25
50
75
100
125
150
175
200
225
250
275
300
325
350
UFS+SU
ZFS
Tim
e in
sec
onds
(le
ss is
bet
ter)
![Page 32: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/32.jpg)
Removing four src directories in parallel
0
20
40
60
80
100
120
140
160
180
200
UFS+SU
ZFS
Tim
e in
sec
onds
(le
ss is
bet
ter)
![Page 33: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/33.jpg)
5GB of sequential write
0
10
20
30
40
50
60
70
80
90
100
UFS+SU
ZFS
Tim
e in
sec
onds
(le
ss is
bet
ter)
![Page 34: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/34.jpg)
4 x 2GB of sequential writes in parallel
0
25
50
75
100
125
150
175
200
225
250
UFS+SU
ZFS
Tim
e in
sec
onds
(le
ss is
bet
ter)
![Page 35: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/35.jpg)
fsx -N 50000 (operating on mmap(2)ed files)
0
5
10
15
20
25
30
35
40
45
50
UFS+SU
ZFS
Tim
e in
sec
onds
(le
ss is
bet
ter)
![Page 36: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/36.jpg)
Changes after initial commit
• rc.d/zfs startup script (by des@)• periodic zfs script (by des@)• support for all architectures• jails support• reports via devd(8)• root on ZFS• hostid• disk identifiers• use of FreeBSD's namecache
![Page 37: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/37.jpg)
Changes after initial commit
• performance improvements;based on help/work from ups@,jhb@, kris@, attilio@
• many bug fixes; based on feedbackfrom FreeBSD community
![Page 38: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/38.jpg)
Changes in the pipeline
• extended attributes based on Solaris'fsattr(5)s
• delegated administration• ZFS boot
![Page 39: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/39.jpg)
Future changes
• POSIX.1e ACLs based on extendedattributes
• NFSv4-style ACLs• iSCSI support for ZVOLs• ZFS configuration at installation
time
![Page 40: FreeBSD/ZFS - last word in operating/file systems](https://reader033.vdocuments.net/reader033/viewer/2022051400/553e1074550346724a8b486e/html5/thumbnails/40.jpg)
Some examples...