fs2: dynamic data replication in free disk space for ...susanna/sisop/p263-huang.pdf · fs2:...

FS2: Dynamic Data Replication in Free Disk Spacefor Improving Disk Performance and Energy

Consumption

Hai Huang, Wanda Hung, and Kang G. ShinDept. of Computer Science and Electrical Engineering The University of Michigan

Ann Arbor, MI 48105, USA

[email protected], [email protected], [email protected]

ABSTRACTDisk performance is increasingly limited by its head positioninglatencies, i.e., seek time and rotational delay. To reduce the headpositioning latencies, we propose a novel technique that dynami-cally places copies of data in file system’s free blocks according tothe disk access patterns observed at runtime. As one or more repli-cas can now be accessed in addition to their original data block,choosing the “nearest” replica that provides fastest access can sig-nificantly improve performance for disk I/O operations.

We implemented and evaluated a prototype based on the popu-lar Ext2 file system. In our prototype, since the file system layoutis modified only by using the free/unused disk space (hence thename Free Space File System, or FS2), users are completely obliv-ious to how the file system layout is modified in the background;they will only notice performance improvements over time. Fora wide range of workloads running under Linux, FS2 is shownto reduce disk access time by 41–68% (as a result of a 37–78%shorter seek time and a 31–68% shorter rotational delay) makinga 16–34% overall user-perceived performance improvement. Thereduced disk access time also leads to a 40–71% energy savingsper access.

Categories and Subject DescriptorsH.3.2 [File Organization]: Information Storage; H.3.4[Performance Evaluation]: Systems and Software

General TermsManagement, Experimentation, Measurement, Performance

KeywordsData replication, free disk space, dynamic file system, disk layoutreorganization

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SOSP’05, October 23–26, 2005, Brighton, United Kingdom.Copyright 2005 ACM 1-59593-079-5/05/0010 ...$5.00.

1. INTRODUCTIONVarious aspects of magnetic disk technology, such as recording

density, cost and reliability, have been significantly improved overthe years. However, due to the slowly-improving mechanical po-sitioning components (i.e., arm assembly and rotating platters) ofthe disk, its performance is falling behind the rest of the systemand has become a major bottleneck to overall system performance.

To reduce these mechanical delays, traditional file systems tendto place objects that are likely to be accessed together so that theyare close to one another on the disk. For example, BSD UNIXFast File System (FFS) [12] uses the concept of cylinder group(one or more physically-adjacent cylinders) to cluster related datablocks and their meta-data together. For some workloads, this cansignificantly improve both disk latency and throughput. Unfortu-nately, for many other workloads in which the disk access patterndeviates from that assumed by the file system, file system perfor-mance can significantly degrade. This problem occurs becausethe file system is unable to take into account of runtime accesspatterns when making disk layout decisions. This inability to usethe observed disk access patterns for data placement decisions ofthe file system will lead to poor disk utilization.

1.1 Motivating ExampleUsing a simple example, we now show how traditional file sys-

tems can sometimes perform poorly even for common workloads.In this example, we monitored disk accesses during the executionof a CVS update command in a local CVS directory containingthe Linux 2.6.7 kernel source tree. As Figure 1(a) shows, almostall accesses are concentrated in two narrow regions of the disk,corresponding to the locations at which the local CVS directoryand the central CVS repository are placed by the file system onthe disk. By observing file directory structures, the file systemis shown to do well in clustering files that are statically-related.Unfortunately, when accesses to files from multiple regions areinterleaved, the disk head has to move back and forth constantlybetween the different regions (as shown in Figure 1(b)), result-ing in poor disk performance in spite of previously-known effortsmade for reasonable placements. Consequently, users will expe-rience longer delays and disks will consume more energy. In ourexperiment, the CVS update command took 33 seconds to com-plete on an Ext2 file system with an anticipatory I/O scheduler. Bymanaging the disk layout dynamically, the update command tookonly 22 seconds to complete in our implementation of FS2.

Traditional file systems perform well for most types of work-

263

0

5e+07

1e+08

1.5e+08

8000 8200 8400 8600 8800 9000

Dis

k S

ect

or

Num

ber

Runtime (msec)

Disk Head Movement

(a) (b)

Figure 1: Part (a) shows disk sectors that were accessed when executing a cvs -q update command within a CVS local directorycontaining the Linux 2.6.7 source code. Part (b) shows the disk head movement within a 1-second window of the disk trace shownin part (a).

loads and not so well for others. We must, therefore, be able tofirst identify the types of workloads and environments that tradi-tional file systems generally perform poorly (listed below), andthen develop good performance-improving solutions for them.

Shared systems: File servers, mail servers, and web servers areall examples of a shared system, on which multiple usersare allowed to work and consume system resources inde-pendently of each other. Problems arise when multiple usersare concurrently accessing the storage system (whether it isa single disk or an array of disks). As far as the file system isconcerned, files of one particular user are completely inde-pendent of those of other users, and therefore, are unlikelyto be stored close to those of other users on disk. Thus,having concurrent disk accesses made by different userscan potentially cause the storage system to become heav-ily multiplexed (therefore less efficiently utilized) betweenseveral statically-unrelated, and potentially-distant, disk re-gions. A PC is also a shared system with similar problems;its disk, instead of being shared among multiple users, isshared among concurrently-executing processes.

Database systems: In database systems, data and data indices areoften stored as large files on top of existing file systems.However, as file systems are well-known to be inefficient instoring large files when they are not sequentially accessed(e.g., transaction processing), performance can be severelylimited by the underlying file system. Therefore, for manycommercial database systems [15, 22], they often managedisk layout themselves to achieve better performance. How-ever, this increases complexity considerably in their imple-mentation. For this reason, many database systems [31, 40,15, 22] still rely on file systems for data storage and re-trieval.

Systems using shared libraries: Shared libraries have been usedextensively by most OSs as they can significantly reduce thesize of executable binaries on disk and in memory [32]. Aslarge applications are increasingly dependent on shared li-braries, placing shared libraries closer to application images

on disk can result in a smaller load time. However, sincemultiple applications can share the same set of libraries, de-termining where on disk to place these shared libraries canbe problematic for existing file systems as positioning a li-brary closer to one application may increase its distance toothers that make use of it.

1.2 Dynamic Management of Disk LayoutMany components of an operating system—those responsible

for networking, memory management, and CPU scheduling—willre-tune their internal policies/algorithms as system state changesso they can most efficiently utilize available resources. Yet, thefile system, which is responsible for managing a slow hardwaredevice, is often allowed to operate inefficiently for the entire lifespan of the system, relying only on very simple heuristics basedon static information. We argue, like other parts of an OS, the filesystem should also have its own runtime component. This allowsthe file system to detect and compensate for any poorly-placeddata blocks. To achieve this, various schemes [36, 44, 2, 16] havebeen proposed in the past. Unfortunately, there are several majordrawbacks with these schemes, which limited their usefulness tobe mostly within the research community. Below, we summarizethese drawbacks and discuss our solutions.

• In previous approaches, when disk layout is modified to im-prove performance, the most frequently-accessed data arealways shuffled toward the middle of the disk. However, asfrequently-accessed data may change over time, this wouldrequire re-shuffling the layout every time the disk accesspattern changes. We will show later that this re-shufflingis actually unnecessary in the presence of good referencelocality and merely incurs additional migration overheadswithout any benefit. Instead, blocks should be rearrangedonly when they are accessed together but lack spatial lo-cality, i.e., those that cause long seeks and long rotationaldelays between accesses. This can significantly reduce thenumber of replications we need to perform to effectivelymodify disk layout.

• When blocks are moved from their original locations theymay lose useful data sequentiality. Instead, blocks should

264

be replicated so as to improve both random and sequen-tial accesses. Improving performance for one workloadat the expense of another is something we would like toavoid. However, replication consumes additional disk ca-pacity and introduces complexities in maintaining consis-tency between original disk blocks and their replicas. Theseissues will be addressed in the next section.

• Using only a narrow region in the middle of the disk to re-arrange disk layout is not always practical if the rearrange-ment is to be done online, because interference with fore-ground tasks has to be considered. Interference can be sig-nificant when foreground tasks are accessing disk regionsthat are far away from the middle region. We will demon-strate ways to minimize this interference by using the freedisk space near current disk head position.

• Some previous techniques reserve a fixed amount of stor-age capacity just for rearranging disk layout. This is in-trusive, as users can no longer utilize the full capacity oftheir disk. By contrast, using only free disk space is com-pletely transparent to users, and instead of being wasted,free disk space can now be utilized to our benefit. To hidefrom users the fact that some of their disk capacity is beingused to hold replicas, we need to invalidate and free repli-cated blocks quickly when disk capacity becomes limited.As the amount of free disk space can vary, this techniqueprovides a variable quality of service, where the degree ofperformance improvement we can achieve depends on theamount of free disk space that is available. However, em-pirical evidence shows that plenty of free disk space can befound in today’s computer systems, so this is rarely a prob-lem.

• Most of previous work is either purely theoretical or basedon trace analysis. We implemented a real system andevaluated it using both server-type workloads and com-mon day-to-day single-user workloads, demonstrating sig-nificant benefits of using FS2.

Reducing disk head positioning latencies can make a significantimpact on disk performance. This is illustrated in Figure 2, wherethe disk access time of 4 different disk drives is broken down tovarious components: transfer time, seek time, and rotational de-lay. This figure shows that transfer time—the only time duringwhich disk is doing useful work—is dwarfed by seek time androtational delay. In a random disk workload (as shown in this fig-ure), seek time is by far the most dominating component in a diskaccess. However, the rotational delay will become the dominatingcomponent when the disk access locality reaches 70% (Lumb etal. [28] studied this in detail). In addition to improving perfor-mance, reduction in seek time (distance) also saves energy as diskdissipates a substantial amount of power when seeking (i.e., whenaccelerating and decelerating disk heads at 30–40g).

The rest of the paper is organized as follows. Section 2 givesan overview of our system design. Section 3 discusses implemen-tation details of our prototype based on the Ext2 file system. Sec-tion 4 presents and analyzes experimental results of FS2 and com-pares it with Ext2. Related work is discussed in Section 5. Futureresearch directions are discussed in Section 6, and the paper con-cludes with Section 7.

0 %

1 00 %

HitachiUltrastar

WDRaptor

WDCaviar

WDScorpio

Seek Time Transfer Time Rotational Delay

Figure 2: Breakdown of the disk access time in 4 drives, rang-ing from a top-of-the-line 15000-RPM SCSI drive to a slow5400-RPM IDE laptop drive.

2. DESIGNThis section details the design of FS2 on a single-disk system.

Most techniques used in building FS2 on a single-disk system canbe directly applied to a disk array (e.g., RAID), yielding a per-formance improvement similar to that can be achieved on a sin-gle disk (in addition to the performance improvements that canbe achieved by using RAID). More on this will be discussed inSection 6.

FS2 differs from other dynamic disk layout techniques duemainly to its exploitation of free disk space, which has a profoundimpact on our design and implementation decisions. In the nextsection, we first characterize the availability of free disk space ina large university computing environment and then describe howfree disk space can be exploited. Details on how to choose can-didate blocks to replicate and how to use free disk space to fa-cilitate the replication of these blocks are given in the followingsection. Then, we describe the mechanisms we used to keep trackof replicas both in memory and on disk so that we can quickly findall alternative locations where data is stored, and then choose the“nearest” one that can be accessed fastest.

2.1 Availability of Free Disk SpaceDue to the rapid increase in disk recording density, much larger

disks can be built with a fewer number of platters, reducing theirmanufacturing cost significantly. With bigger and cheaper disksbeing widely available, we are much less constrained by disk ca-pacity, and hence, are much more likely to leave a significant por-tion of our disk unused. This is shown from our study of the 242disks installed on the 22 public servers maintained by the Univer-sity of Michigan’s EECS Department in April 2004. The amountof unused capacity that we have found on these disks is shownin Figure 3. As we have expected, most disks had a substantialamount of unused space—60% of the disks had more than 30%unused capacity, and almost all the disks had at least 10% unusedcapacity. Furthermore, as file systems may reserve additional diskspace for emergency/administrative use, which we did not accountfor in this study, our observation only gives a lower bound of howmuch free disk space there really is.

Our observation is consistent with the study done by Douceuret al.[9] who reported an average use of only 53% of disk spaceamong 4801 Windows personal computers in a commercial envi-ronment. This is a significant waste of resource because, whendisk space is not being used, it usually contributes nothing to

265

0

5

10

15

20

25

30

35

5% 10%

15%

20%

25%

30%

35%

40%

45%

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

Percentage of Disk Full

His

tog

ram

0%

20%

40%

60%

80%

100%

120%

CD

F

FrequencyCumulative %

Figure 3: This figure shows the amount of free disk space on 242 disks from 22 public server machines.

users/applications. So, we argue that free disk space can be usedto dynamically change disk layout so as to increase disk I/O per-formance.

Free disk space is commonly maintained by file systems as alarge pool of contiguous blocks to facilitate block allocation re-quests. This makes the use of free disk space attractive for severalreasons. First, the availability of these large contiguous regionsallows discontiguous disk blocks which are frequently accessedtogether, to be replicated and aggregated efficiently using large se-quential writes. By preserving replicas’ adjacency on disk accord-ing to the order in which their originals were accessed at runtime,future accesses to these blocks will be made significantly faster.Second, contiguous regions of free disk blocks can usually befound throughout the entire disk, allowing us to place replicas tothe location closest to the current disk head’s position. This min-imizes head positioning latencies when replicating data, and onlyminimally affects the performance of foreground tasks. Third, us-ing free disk space to hold replicas allows them to be quickly in-validated and then their space to be converted back to free diskspace whenever disk capacity becomes tight. Therefore, users canbe completely oblivious to how the “free” disk capacity is used,and will only notice performance improvements when more freedisk space becomes available.

However, the use of free disk space to hold replicas is not with-out costs. First, as multiple copies of data may be kept, we willhave to keep data and their replicas consistent. Synchronizationcan severely degrade performance if for each write operation, wehave to perform additional writes (possibly multiple) to keep allits replicas consistent. Second, when the user demands more diskspace, replicas should be quickly invalidated and their occupieddisk space returned to the file system so as to minimize user-perceived delays. However, as this would undo some of the workdone previously, we should try to avoid it, if possible. Moreover,as the amount of free disk space can vary over time, deciding howto make use of the available free disk space is also important, espe-cially when there is only a small amount of free space left. Theseissues will be addressed in the following subsections.

Disk LayoutFile 0 File 2 File 1

Sector 0

1

Sector 0

1

2

2

File 0 File 2 File 1Dup

File 1

Figure 4: Files are accessed in the order of 0, 1, and 2. Due tothe long inter-file distance between File 1 and the other files,long seeks would result if accesses to these files are interleaved.In the bottom figure, we show how the seek distance (time)could be reduced when File 1 is replicated using the free diskspace near the other two files.

2.2 Block ReplicationWe now discuss how to choose which blocks to replicate so as

to maximize the improvement of disk I/O performance. We startwith an example shown in Figure 4, describing a simple scenariowhere three files—0, 1, and 2—are accessed, in that order. Onecan imagine that Files 0 and 2 are binary executables of an ap-plication, and File 1 is a shared library, and therefore, may beplaced far away from the other two files, as shown in Figure 4.One can easily observe that File 1 is placed poorly if it is to befrequently accessed together with the other two files. If the filesare accessed sequentially, where each file is accessed completelybefore the next, the overhead of seeking is minimal as it wouldonly result in two long seeks—one from 0 to 1 and another from1 to 2. However, if the accesses to the files are interleaved, per-formance will degrade severely as many more long-distant seekswould result. To improve performance, one can try to reduce theseek distance between the files by replicating File 1 closer to theother two as shown in the bottom part of Figure 4. ReplicatingFile 1 also reduces rotational delay when its meta-data and dataare placed onto the disk consecutively according to the order in

266

which they were accessed. This is illustrated in Figure 5. Reduc-ing rotational delay in some workloads (i.e., those already witha high degree of disk access locality) can be more effective thanreducing seek time in improving disk performance.

Sequential I/O

Inode Data 0

File 1

Data 1 Data 2

Dup File 1

Inode Data 0Data 1 Data 2DupInode

DupData 0

DupData 1

DupData 2

13

2

1 2 3

File 1

Figure 5: This figure shows a more detailed disk layout of File1 shown in Figure 4. As file systems tend to place data andtheir meta-data (and also related data blocks) close to one an-other, intra-file seek distance (time) is usually short. However,as these disk blocks do not necessarily have to be consecutive,rotational delay will become a more important factor. In thelower portion of the figure, we show how rotational delay canbe reduced by replicating File 1 and placing its meta-data anddata in the order that they were accessed at runtime.

In real systems, the disk access pattern is more complicated thanthat shown in the above examples. The examples are mainly toprovide an intuition for how seek time and rotational delay can bereduced via replication. We present more practical heuristics usedin our implementation in Section 3.

2.3 Keeping Track of ReplicasTo benefit from replication of disk blocks, we must be able

to find them quickly and decide whether it is more beneficial toaccess these replicas or their original. This is accomplished bykeeping a hash table in memory as shown in Figure 6. It occupiesonly a small amount of memory: assuming 4KB file system datablocks, a 4MB hash table suffices to keep track of 1GB of replicason disk. As most of today’s computer systems have hundreds ofmegabytes of memory or more, it is usually not a problem to keepall the hash entries in memory. For each hash entry, we maintain4 pieces of information: the location of the original block, the lo-cation of its replica, the time when the replica was last accessed,and the number of times that the replica was accessed. When freedisk space drops below a low watermark (at 10%), the last twoitems are used to find and invalidate replicas that have not beenfrequently accessed and were not recently used. At this point,we suspend the replication process. It is resumed only when ahigh watermark (at 15%) is reached. We also define a criticalwatermark (at 5%), which triggers a large number of replicas tobe invalidated in a batched fashion so a contiguous range of diskblocks can be quickly released to users. This is done by group-ing hash entries of replicas that are close to one another on diskso that they are also close to one another in memory, as shownin Figure 6. When the number of free disk blocks drops belowthe critical watermark, we simply invalidate all the replicas of anentire disk region, one region at a time, until the number of freedisk blocks is above the low watermark or when there are no morereplicas. During the process, we could have potentially invali-

dated more valuable replicas and kept less valuable ones. This isacceptable since our primary concern under such a condition is toprovide users with the needed disk capacity as quickly as possible.

Given that we can find replicas quickly using the hash table, de-ciding whether to use an original disk block or one of its replicasis as simple as finding the one closest to the current disk head lo-cation. The location of the disk head can be predicted by keepingtrack of the last disk block that was accessed in the block devicedriver.

As a result of replication, there may be multiple copies of a datablock placed at different locations on disk. Therefore, if the datablock is modified, we need to ensure that all of its replicas remainconsistent. This can be accomplished by either updating or inval-idating the replicas. In the former (updating) option, modifyingmultiple disk blocks when a single block is modified can be ex-pensive. Not only it generates more disk traffic, but it also incurssynchronization overheads. Therefore, we chose the latter optionbecause both block invalidation and block replication can be donecheaply and in the background.

Disk Layout

Disk

Memory

Hash entries

Hash entries

Region 0 Region 1

nextpage

Hash Entry Details

nextregion Hash entries

Block # Dup #

TS Ref Cnt

Block #: Original block numberDup #: Duplicated block numberTS: Time stamp of the last accessRef Cnt: Reference count

Figure 6: A detailed view of the hash data structure used tokeep track of replicas. Hashing is used to speed up the lookupof replicas. Replicas that are close to one another on disk alsohave their hash entries close to one another in memory, so acontiguous range of replicas can be quickly invalidated andreleased to users when needed.

3. PROTOTYPEOur implementation of FS2 is based on the Ext2 file system [4].

In our implementation, the functionality of Ext2 is minimally al-tered. It is modified mainly to keep track of replicas and to main-tain data consistency. The low-level tasks of monitoring disk ac-cesses, scheduling I/Os between original and replicated blocks,and replicating data blocks are delegated to the lower-level blockdevice driver. Implementation details of that in the block devicedriver and the file system are discussed in Section 3.1 and Sec-tion 3.2, respectively. Section 3.3 describes some of the user-leveltools we developed to help users manage the FS2 file system.

3.1 Block Device ImplementationBefore disk requests are sent to the block device driver to be

serviced, they are first placed onto an I/O scheduler queue, where

267

they are queued, merged, and rearranged so the disk may be bet-ter utilized. Currently, the Linux 2.6 kernel supports multiple I/Oschedulers that can be selected at boot time. It is still highly de-batable which I/O scheduler performs best, but for workloads thatare highly parallel in disk I/Os, the anticipatory scheduler [23] isclearly the winner. Therefore, it is used in both the base Ext2 filesystem and our FS2 file system to make a fair comparison.

In our implementation, the anticipatory I/O scheduler is slightlymodified so a replica can be accessed in place of its original if thereplica can be accessed faster. A disk request is represented by astarting sector number, the size of the request, and the type of therequest (read or write). For each read request (write requests aretreated differently) that the I/O scheduler passes to the block de-vice driver, we examine each of the disk blocks within the request.There are several different cases we must consider. In the simplestcase, where none of the blocks have replicas, we have no choicebut to access the original blocks. In the case where all disk blockswithin a request have replicas, we access the replicas instead oftheir originals when (i) the replicas are physically contiguous ondisk, and (ii) they are closer to the current disk head’s locationthan their originals. If both criteria are met, to access the repli-cas instead of their originals, we would need to modify only thestarting block number of that disk request. If only a subset of theblocks has replicas, they should not be used as it would break upa single disk request into multiple ones, and in most cases, wouldtake more time. For write accesses, replicas of modified blocksare simply invalidated, as it was explained in Section 2.3.

Aside from scheduling I/Os between original and replicatedblocks, the block device driver is also responsible for monitoringdisk accesses so it can decide which blocks we should replicateand where the replicas should be placed on disk. We first showhow to decide which blocks should be replicated. As mentionedearlier, good candidates are not necessarily frequently-accesseddisk blocks, but rather, temporally-related blocks that lack goodspatial locality. However, it is unnecessary to replicate all thecandidate blocks. This was illustrated previously by the exam-ple shown in Figure 4. Instead of replicating all three files, it issufficient to replicate only File 1—we would like to do a min-imal amount of work to reduce mechanical movements. To doso, we first find a hot region on disk. A hot region is defined asa small area on disk where the disk head has been most active,and it can change from time to time as workload changes. Tominimize disk head positioning latencies, we would like to keepthe disk head from moving outside of the hot region. This is ac-complished by replicating outside disk blocks that are frequentlyaccessed together with blocks within the hot region and placingthem there, if possible. This allows similar disk access patterns tobe serviced much faster in future; we achieve shorter seek timesdue to shorter seek distances and smaller rotational delays becausereplicas are laid out on disk in the same order that they were pre-viously accessed. In our implementation, we divide a disk intomultiple regions,1 and for each we count the number of times theblocks within the region have been accessed. From the referencecount of each disk region, we can approximately locate where thedisk head has been most active. This is only an approximationbecause (i) some disk blocks can be directly accessed from/to theon-disk cache, thus not involving any mechanical movements, and

1For convenience, we simply define a region to be a multiple ofExt2 block groups.

(ii) some read accesses can trigger 2 seek operations if it causesa cache entry conflict and that the cache entry is dirty. However,in the disk traces collected from running various workloads (de-scribed in Section 4), we found this heuristic to be sufficient forour purpose.

Although the block device driver can easily locate the hot re-gion on disk and find candidate blocks to replicate, it cannot per-form replication by itself because it cannot determine how muchfree disk space there is, nor can it decide which blocks are free andwhich are in use. Implicit detection of data liveness from the blockdevice level was shown by Sivathanu et al. [13] to be impossiblein the Ext2 file system. Therefore, replication requests are for-warded from the block device driver to the file system, where freedisk space to hold replicas can be easily located. Contiguous freedisk space is used if possible, so that the replicated blocks can bewritten sequentially to disk. Furthermore, since disk blocks canbe replicated more efficiently to regions with more free space, weplace more weight on these regions when deciding where to placereplicas.

3.2 File System ImplementationExt2 is a direct descendant of BSD UNIX FFS [12], and is com-

monly chosen as the default file system by many Linux distribu-tions. Ext2 splits a disk into one or more block groups, each ofwhich contains its own set of inodes (meta-data) and data blocksto enhance data locality and fault-tolerance. More recently, jour-naling capability was added to Ext2, to produce the Ext3 file sys-tem [43], which is backward-compatible to Ext2. We could haveeasily based our implementation on other file systems, but wechose Ext2 because it is the most popular file system used in to-day’s Linux systems. As with Ext3, FS2 is backward-compatiblewith Ext2.

First, we modified Ext2 so it can assist the block device driverto find free disk space near the current disk head’s location toplace replicas. As free disk space of each Ext2 block group ismaintained by a bitmap, large regions of free disk space can beeasily found. We also modified Ext2 to assist the block devicedriver in maintaining data consistency. In Section 3.1, we dis-cussed how data consistency is maintained for write accesses inthe block device level. However, not all data consistency issuescan be observed and addressed in the block device level. For ex-ample, when a disk block with one or more replicas is deleted ortruncated from a file, all of its replicas should also be invalidatedand released. As the block device driver is completely unawareof block deallocations, we modified the file system to explicitlyinform the block device driver of such an event so it can invalidatethese replicas and remove them from the hash table.

To allow replicas to persist across restarts, we flush the in-memory hash table to the disk when system shuts down. When thesystem starts up again, the hash table is read back into the mem-ory. We keep the hash table as a regular Ext2 file using a reservedinode #9 so it cannot be accessed by regular users. To minimizethe possibility of data inconsistency that may result from a sys-tem crash, we flush any modified hash entries to disk periodically.We developed a set of user-level tools (discussed in Section 3.3)to restore the FS2 file system back to a consistent state when dataconsistency is compromised after a crash. An alternative solutionis to keep the hash table in flash memory. Flash memory pro-vides a persistent storage medium with low-latency so the hashtable can be kept consistent even when the system unexpectedly

268

crashes. Given that the size of today’s flash memory devices usu-ally ranges between a few hundred megabytes to tens of gigabytes,the hash table can be easily stored.

Ext2 is also modified to monitor the amount of free disk spaceto prevent replication from interfering with user’s normal file sys-tem operations. We defined a high (15%), a low (10%), and a crit-ical (5%) watermark to monitor the amount unused disk space andto respond with appropriate actions when a watermark is reached.For instance, when the amount of free disk space drops below thelow watermark, we will start reclaiming disk space by invalidatingreplicas that have been used neither recently nor frequently. If theamount of free disk space drops below the critical watermark, westart batching invalidations for entire disk regions as described inSection 2.3. Moreover, replication is suspended when the amountof free disk space drops below the low watermark, and it is re-sumed only when the high watermark is reached again.

3.3 User-level ToolsWe created a set of user-level tools to help system administra-

tors manage the file system. mkfs2 is used to convert an existingExt2 file system to an FS2 file system, and vice versa. Normally,this operation takes less than 10 seconds. Once an FS2 file sys-tem is created and mounted, it will automatically start monitor-ing disk traffic and replicating data on its own. However, userswith higher-level knowledge about workload characteristics canalso help make replication decisions by asking mkfs2 to staticallyreplicate a set of disk blocks to a specified location on disk. Itgives users a knob to manually tune how free disk space should beused.

As mentioned earlier, a file system may crash, resulting in datainconsistency. During system startup, the tool chkfs2 restores thefile system to a consistent state if it detects that the file system wasnot cleanly unmounted, similarly to what Ext2’s e2fsck tool does.The time to restore an FS2 file system back to a consistent state isonly slightly longer than that of Ext2, and depends on the numberof replicas in the file system.

4. EXPERIMENTAL EVALUATIONWe now evaluate FS2 under some realistic workloads. By

dynamically modifying disk layout according to the disk accesspattern observed at runtime, seek time and rotational delay areshown to be reduced significantly, which translates to a substantialamount of performance improvement and energy savings. Sec-tion 4.1 describes our experimental setup and Section 4.2 detailsour evaluation methodology. Experimental results of each work-load are discussed in each of the subsequent subsections.

4.1 Experimental SetupOur testbed is set up as shown in Figure 7. The IDE device

driver on the testing machine is instrumented, so all of its disk ac-tivities are sent via netconsole to the monitoring machine, wherethe activities are recorded. We used separate network cards forrunning workloads and for collecting disk traces to minimize in-terference. System configuration of the testing machine and a de-tailed specification of its disk are provided in Table 1.

Several stress tests were performed to ensure that the systemperformance is not significantly affected by using netconsole tocollect disk traces. We found that the performance was degradedby 5% in the worst case. For common workloads, performance isusually minimally affected. For example, the time to compile a

Figure 7: The testbed consists a testing machine for run-ning workloads and a monitoring machine for collecting disktraces. Each machine is equipped with 2 Ethernet cards—onefor running workloads and one for collecting disk traces usingnetconsole.

Linux 2.6.7 kernel source code was not impacted at all by record-ing disk activities via netconsole. Therefore, we believe our trac-ing mechanism had only a negligible effect on the experimentalresults.

4.2 Evaluation MethodologyFor each request in a disk trace, we recorded its start and end

times, the starting sector number, the number of sectors requested,and whether it is a read or a write access. The interval between thestart and end times is the disk access time, denoted by TA, whichis defined as:

TA = Ts + Tr + Tt,

where Ts is the seek time, Tr is the rotational delay, and Tt isthe transfer time. Decomposing TA is necessary for us to under-stand the implications of FS2 on each of these components. If thephysical geometry of the disk is known (e.g., number of platters,number of cylinders, number of zones, sector per track, etc.), thiscan be easily accomplished with the information that was logged.Unfortunately, due to increasingly complex hard drive designs andfiercer market competition, disk manufacturers are no longer dis-closing such information to the public. They present only a logical

Components Specification

CPU Athlon 1.343 GHzMemory 512MB DDR2700Network cards 2x3COM 100-MbpsDisk Western Digital

Bus interface IDECapacity 78.15 GBRotational speed 7200 RPMAverage seek time 9.9 msTrack-to-track seek time 2.0 msFull-stroke seek time 21 msAverage rotational delay 4.16 msAverage startup power 17.0 WAverage read/write/idle power 8.0 WAverage seek power 14.0 W

Table 1: System specification of the testing machine.

269

view of the disk, which, unfortunately, bears no resemblance tothe actual physical geometry; it provides just enough informationto allow BIOS and drivers to function correctly. As a result, it isdifficult to decompose a disk access into its different components.Various tools [37, 42, 14, 1] have been implemented to extractthe physical disk geometry experimentally, which usually takes avery long time. Using the same empirical approach, we show thata disk can be characterized in seconds for us to accurately decom-pose TA into its various components.

Figure 8: This figure shows logical seek distance versus accesstime for a Western Digital SE 80GB 7200 RPM drive. Eachpoint in the graph represents a recorded disk access.

Using the same experimental setup as described previously, arandom disk trace is recorded from the testing machine. Fromthis trace, we observed a clear relationship between logical seekdistance (i.e., the distance between consecutive disk requests inunit of sectors) and TA, which is shown in Figure 8. The verticalband in this figure is an artifact of the rotational delay’s variabil-ity, which, for a 7200 RPM disk, can vary between 0 and 8.33msec (exactly the height of this band). A closer examination ofthis figure reveals an irregular bump in the disk access time whenseek distance is small. This bump appears to reflect the fact thatthe seek distance is measured in number of sectors, but the num-ber of sectors per track can vary significantly from the innermostcylinder to the outermost cylinder. Thus, the same logical distancecan translate to different physical distances, and this effect is morepronounced when the logical seek distance is small.

As TA is composed of Ts, Tr , and Tt, by removing the variationcaused by Tr , we are left with the sum of Ts and Tt, which isrepresented by the lower envelope of the band shown in Figure 8.As Tt is negligible compared to Ts for a random workload (shownpreviously in Figure 2), the lower envelope essentially representsthe seek profile curve of this disk. This is similar to the seek profilecurve, in which, the seek distance is measured in the number ofcylinders (physical distance) as opposed to the number of sectors(logical distance). As shown in Figure 8, the seek time is usuallylinearly related to the seek distance, but for very short distances,it can be approximated as a third order polynomial equation. Weobserved that for a seek distance of x sectors, Ts (in unit of msec)can be expressed as:

Ts =

8E-21x3− 2E-13x2 + 1E-6x + 1.35 x < 1.1 × 107

9E-8x + 4.16 otherwise.

As seek distance can be calculated by taking the difference be-tween the starting sector numbers of consecutive disk requests,seek time, Ts, can be easily computed from the above equation.With the knowledge of Ts, we only have to find either Tr or Tt tocompletely decompose TA. We found it easier to derive Tt thanTr. To derive Tt, we first measured the effective I/O bandwidthfor each of the bit recording zones on the disk using large sequen-tial I/Os (i.e., so we can eliminate seek and rotational delays). Forexample, on the outer-most zone, the disk’s effective bandwidthwas measured to be 54 MB/sec. With each sector being 512 byteslong, it would take 9.0 µsec to transfer a single sector from thiszone. From a disk trace, we know exactly how many sectors areaccessed in each disk request, and therefore, the transfer time ofeach can be easily calculated by multiplying the number of sec-tors in the request by the per-sector transfer time. Once we haveboth Ts and Tt, we can trivially derive Tr. Due to disk caching,some disk accesses will have an access time below the seek profilecurve, in which case, since there is no mechanical movement, wetreat all of their access time as transfer time.

The energy consumed to service a disk request, denoted as EA,can be calculated from the values of Ts, Tr and Tt as:

EA = Ts × Ps + (Tr + Tt) × Pi,

where Ps is the average seek power and Pi is the averageidle/read/write power (shown in Table 1). The total energy con-sumed by a disk can be calculated by adding the sums of all EA’sto the idle energy consumed by the disk.

4.3 The TPC-W BenchmarkThe TPC-W benchmark [5], specified by the Transaction Pro-

cessing Council (TPC), is a transactional web benchmark that sim-ulates an E-commerce environment. Customers browse and pur-chase products from a web site composed of multiple servers in-cluding an application server, a web server and a database server.In our experiment, we ran a MySQL database server and anApache web server on a single machine. On a separate client ma-chine, we simulated 25 simultaneous customers for 20 minutes.Complying with the TPC-W specification, we assume a mean ses-sion time of 15 minutes and a mean think time of 7 seconds foreach client. Next, we compare FS2 with Ext2 with respect to per-formance and energy consumption.

4.3.1 Ext2: Base SystemFigure 9(a1) shows a disk trace collected from running the TPC-

W benchmark on an Ext2 file system. Almost all disk accesses areconcentrated in several distinct regions of the disk, containing amixture of the database’s data and index files, and the web server’simage files. The database files can be large, so a single file canspan multiple block groups. Because Ext2 uses a quadratic hashfunction to choose the next block group from which to allocatedisk blocks when the current block group is full, a single databasefile is often split into multiple segments with each being stored ata different location on the disk. When the file is later accessed, thedisk head may have to travel frequently between multiple regions.This is illustrated in Figure 9(a2), showing a substantial numberof disk accesses with long seek times.

We observed that the average response time perceived by clientsis 750 msec when the benchmark is ran on an Ext2 file system.The average disk access time is found to be 8.2 msec, in which 3.8

270

(a1) Ext2 (b1) FS2-static (c1) FS2-dynamic

(a2) Ext2 (b2) FS2-static (c2) FS2-dynamic

Figure 9: Plots (a1–c1) show the disk sectors that were accessed for the duration of a TPC-W run. Plots (a2–c2) show the measuredaccess times for each request with respect to the seek distance for the TPC-W benchmark. Plots a, b, and c correspond to theresults collected on Ext2, FS2-static, and FS2-dynamic file systems.

0

0.2

0.4

0.6

0.8

EXT2 FS2-Static

FS2-DynamicA

vg

. re

sp

on

se

tim

e (

s)

0

2

4

6

8

10

EXT2 FS2-Static

FS2-Dynamic

Tim

e (

ms)

/Access

TransferRotationalSeek

z

0

20

40

60

80

100

EXT2 FS2-Static

FS2-Dynamic

En

erg

y (

mJ)

/Access


c

(a) (b) (c)

Figure 10: For the TPC-W benchmark, part (a) shows the av-erage response time of each transaction perceived by clients.Part (b) shows the average disk access time and its breakdown.Part (c) shows the average energy consumed by each disk ac-cess and its breakdown.

msec is due to seek and 4.3 msec is due to rotation. The averageenergy consumed by each disk access is calculated to be 90 mJ,in which 54 mJ is due to seek and 36 mJ is due to rotation. Theseresults are plotted in Figure 10 and then summarized in Table 2.

4.3.2 FS2-StaticFrom the TPC-W disk trace shown previously, an obvious

method to modify the disk layout would be to group all thedatabase and web server files closer to one another on the diskso as to minimize seek distance. To do so, we first find all the diskblocks that belong to either the database or the web server, andthen statically replicate these blocks (using the mkfs2 tool) toward

Disk Performance Improvement EnergyBusy TA Ts Tr Improvement

FS2-static 23% 24% 53% -1.6% 31%FS2-dynamic 17% 50% 72% 31% 55%

Table 2: For the TPC-W benchmark, this table shows percent-age improvement in performance and energy-consumption foreach disk access, with respect to the Ext2 file system, where thedisk was busy 31% of the time.

the middle of the disk.2 After the disk layout is modified, we ranthe TPC-W workload again, and the resulting disk trace and theaccess time distribution are plotted in Figures 9(b1) and (b2), re-spectively. One can see from these figures that as replicas can nowbe accessed in addition to their originals, most disk accesses areconcentrated within a single region. This significantly reduces therange of motions of the disk head for most disk accesses.

As shown in Figure 10, by statically replicating, the average re-sponse time perceived by TPC-W clients is improved by 20% overExt2. The average disk access time is improved by 24%, whichis completely due to having shorter seek time (53%). Rotationaldelay remains mostly unchanged (-1.6%) for reasons to be givenshortly. Due to having shorter average seek distance, the energyconsumed per disk access is reduced by 31%.

2We could have chosen any region, but the middle region of thedisk is chosen simply because most of the relevant disk blockswere already placed there. Choosing this region, therefore, wouldresult in the fewest number of replications.

271

(a) (b) (c)

Figure 11: For the TPC-W benchmark, parts (a–c) show disk access time for the 1st, the 2nd, and the 7th FS2-dynamic run.

4.3.3 FS2-DynamicWe call the technique described in the previous section FS2-

static because it statically modifies the disk layout. Using FS2-static, seek time was significantly reduced. Unfortunately, asFS2-static is not effective in reducing rotational delay, disk ac-cess time is now dominated more by rotational delay (as shownin Figure 10(b)). This ineffectiveness in reducing rotational de-lay is attributed to its static aggregation of all the disk blocks ofdatabase and web server files. It makes the relevant disk blocksmuch closer to each other on disk, but as long as they are notplaced on the same cylinder or track, rotational delay will remainunchanged.

(a) (b)

Figure 12: Plots (a) and (b) show the zoom-in section of Fig-ures 9 (b1) and (c1), respectively.

As discussed in Section 2.2, replicating data and placing themadjacent to one another according to the order in which they wereaccessed, not only reduces seek time but also can reduce rota-tional delay. This is called FS2-dynamic as we dynamically mon-itor disk accesses and replicate candidate blocks on-the-fly. It isimplemented as described in Sections 2 and 3. Starting with noreplication, multiple runs were needed for the system to determinewhich blocks to replicate before observing any significant benefit.The replication overhead was observed to be minimal. Even inthe worst case (the first run), the average response time perceivedby the TPC-W clients was degraded by only 1.7%. In other runs,the replication overhead is almost negligible. The 7th run of FS2-dynamic is shown in Figures 9(c1) and (c2). To illustrate the pro-gression between consecutive runs, we show disk access time dis-tributions of the 1st, the 2nd, and the 7th run in Figure 11. During

each run, because the middle region of the disk is accessed most,those disk blocks that were accessed in other regions will get repli-cated here. Consequently, the resulting disk trace looks very sim-ilar to that of FS2-static (shown in Figures 9(b1) and (b2)). Thereare, however, subtle differences between the two, which enableFS2-dynamic to significantly outperform its static counterpart.

Figures 12(a) and (b) provide a zoom-in view of the first 150seconds of the FS2-static disk trace and the first 100 seconds ofthe FS2-dynamic disk trace, respectively. In these two cases, thesame amount of data was accessed from the disk, giving a clearindication that the dynamic technique is more efficient in replicat-ing data and placing them onto the disk than the static counterpart.As mentioned earlier, the static technique indiscriminately repli-cates all statically-related disk blocks together without any regardto how the blocks are temporally related. As a result, the distancebetween temporally-related blocks becomes smaller but often notsmall enough for them to be located on the same cylinder or track.On the other hand, as the dynamic technique places replicas ontothe disk in the same order that the blocks were previously ac-cessed, it substantially increases the chance for temporally-relateddata blocks to be on the same cylinder, or even on the same track.From Figure 12, we can clearly see that under FS2-dynamic, therange of the accessed disk sectors is much narrower than that ofFS2-static.

As a result, FS2-dynamic makes a 34% improvement in the av-erage response time perceived by the TPC-W clients compared toExt2. The average disk access time is improved by 50%, whichis due to a 72% improvement in seek time and a 31% improve-ment in rotational delay. FS2-dynamic reduces both seek time androtational delay more significantly than FS2-static. Furthermore,it reduces the average energy consumed by each disk access by55%. However, because the TPC-W benchmark always runs fora fixed amount of time (i.e., 20 minutes), faster disk access timedoes not reduce runtime, and hence, the amount of savings in totalenergy is limited to only 6.7% as idle energy still dominates. Inother workloads, when measured over a constant number of diskaccesses, as opposed to constant time, the energy savings is moresignificant.

We introduced FS2-static and compared it with FS2-dynamicmainly to illustrate the importance of using online monitoring inreducing rotational delay. As the dynamic technique was shownto perform much better than the static one, we will henceforthconsider only FS2-dynamic and refer to it simply as FS2 unless

272

specified otherwise. Moreover, only the 7th run of each workloadis used when we present results for FS2 as we want to give enoughtime for our system to learn frequently-encountered disk accesspatterns and to modify the disk layout.

0

10

20

30

40

EXT2 FS2-Dynamic

Ru

nti

me (

s)

0

0.2

0.4

0.6

0.8

1

EXT2 FS2-Dynamic

Tim

e (

ms)

/Access


0

3

6

9

12

EXT2 FS2-DynamicE

nerg

y (

mJ)

/Access


(a) (b) (c)

Figure 13: Results for the CVS update command: (a) totalruntime, (b) average access time, and (c) energy consumed peraccess.


FS2 73% 41% 38% 53% 40%

Table 3: A summary of results for the CVS update command.The disk was busy 82% of the time on Ext2.

4.4 CVSCVS is a versioning control system commonly used in software

development environments. It is especially useful for multiple de-velopers who work on the same project and share the same codebase. CVS allows each developer to work with his own localsource tree and later merge the finished work with a central repos-itory.

CVS update command is one of the most frequently-used com-mands when working with a CVS-managed source tree. In thisbenchmark, we ran the command cvs -q update in a local CVS di-rectory containing the Linux 2.6.7 kernel source tree. The resultsare shown in Figure 13. On an Ext2 file system, it took 33 secondsto complete. On FS2, it took only 23 seconds to complete, a 31%shorter runtime.

The average disk access time is improved by 41% as shown inFigure 13(b). Rotational delay is improved by 53%. Seek timeis also improved (37%), but its effect is insignificant as the diskaccess time in this workload is mostly dominated by rotationaldelay.

In addition to the performance improvement, FS2 also savesenergy in performing disk I/Os. Figure 13(c) plots the averageenergy-consumption per disk access for this workload. As shownin Table 3, FS2 reduces the energy-consumption per access by46%. The total energy dissipated on Ext2 and FS2 during theentire execution of the CVS update command is 289 J and 199 J,respectively, making a 31% energy savings.

4.5 Starting X Server and KDEThe startx command starts the X server and the KDE windows

manager on our testing machine. As shown in Figure 14(a), thestartx command finished 4.6 seconds faster on FS2 than Ext2,yielding a 16% runtime reduction. Even though the average diskaccess time in this workload is significantly improved (44%),which comes as a result of a 53% shorter average seek time and47% shorter rotational delay (shown in Table 4), the runtime is

0

7

14

21

28

EXT2 FS2-Dynamic

Ru

nti

me (

s)

0

1

2

3

4

EXT2 FS2-Dynamic

Tim

e (

ms)

/Access Transfer

RotationalSeek

0

10

20

30

40

EXT2 FS2-Dynamic

En

erg

y (

mJ)

/Access


(a) (b) (c)

Figure 14: Results for starting X server and KDE windowsmanager: (a) total runtime, (b) average access time, and (c)energy consumed per access.


FS2 26% 44% 53% 47% 46%

Table 4: A summary of results for starting X server and KDEwindows manager. The disk was busy 37% of the time onExt2.

only moderately improved and much less than what we had ini-tially expected. The reason is that this workload is not disk I/O-intensive, i.e., much less than the CVS workload. For the CVSworkload, the disk was busy for 82% of the time, whereas for thisworkload, the disk was busy for only 37% of the time.

In addition to improving performance, FS2 also reduces the en-ergy consumed per disk access for the startx command as shownin Figure 14(c). Starting the X server is calculated to dissipate atotal energy of 249 J on Ext2 and 201 J on FS2, making a 19%savings in energy.

4.6 SURGESURGE is a network workload generator designed to simu-

late web traffic by imitating a collection of users accessing a webserver [3]. To best utilize the available computing resources, manyweb servers are supporting multiple virtual hosts. To simulate thistype of environment, we set up an Apache web server supportingtwo virtual hosts servicing static contents on ports 81 and 82 ofthe testing machine. Each virtual host supports up to 200 simulta-neous clients in our experiment.

FS2 reduced the average response time of the Apache webserver by 18% compared to Ext2 as illustrated in Figure 15(a).This improvement in the average response time was made pos-sible despite the fact that the disk was utilized for only 13% ofthe time. We tried to increase the number of simultaneous clientsto 300 per virtual host to observe what would happen if the webserver was busier, but the network bandwidth bottleneck on our100 Mb/sec LAN did not allow us to drive up the workload. Withmore simultaneous clients, we expect to see more performanceimprovements by FS2. However, we show that even for non-disk-bound workloads, performance can still be improved reasonablywell.

On average, a disk access takes 68% less time on FS2 than onExt2. In this particular workload, reductions in seek time and rota-tional delay contributed almost equally in lowering the disk accesstime, as shown in Table 5.

The total energy consumed was reduced by merely 1.9% onFS2 because this workload always runs for a fixed amount timelike the TPC-W workload. There might be other energy-savingopportunities. In particular, FS2 allows disk to become idle moreoften—in the TPC-W workload, the disk was idle 70% of the time

273

0

0.5

1

1.5

2

2.5

EXT2 FS2-DynamicAvg

. re

sp

on

se t

ime (

s)

0

2

4

6

8

EXT2 FS2-Dynamic

Tim

e (

ms)

/Access


0

20

40

60

80

EXT2 FS2-Dynamic

En

erg

y (

mJ)

/Access


(a) (b) (c)

Figure 15: Results for SURGE: (a) average response time, (b)average access time, and (c) energy consumed per access.


FS2 4.2% 69% 78% 68% 71%

Table 5: A summary of results for running SURGE. The diskwas busy 13% of the time on Ext2.

on Ext2 and 83% on FS2, while in the CVS workload, the diskwas idle 18% of the time on Ext2 and 27% on FS2. Combinedwith traditional disk power-management techniques, which saveenergy during idle intervals, more energy-saving opportunities canbe exploited.

5. RELATED WORKFile System: FFS [12] and its successors [4, 43] improve disk

bandwidth and latency by placing related data objects (e.g., datablocks and inodes) near each other on disk. Albeit effective ininitial data placement, they do not consider dynamic access pat-terns. As a result, poorly-placed disk blocks can significantlylimit the degree of performance improvements [2, 36]. Based onFFS, Ganger et al. [17] proposed a technique that improves per-formance for small objects by increasing their adjacency ratherthan just locality. Unfortunately, they also suffer from the sameproblem as FFS.

In the Log-structured File System (LFS) [35], disk-write per-formance can be significantly improved by placing blocks that arewritten close to one another in time to large contiguous segmentson disk. Similarly, FS2 places (replicas of) blocks that are readclose to one another in time to large contiguous free disk space.Tuning LFS by putting active segments on higher-bandwidth outercylinders of the disk can further improve write performance [45].However, due to segment cleaning overheads and LFS’s inabilityto improve read performance (and most workloads are dominatedby reads), there is no clear indication that it can outperform thefile systems in the FFS family.

The closest work related to FS2 are Hot File Clustering usedin the Hierarchical File System (HFS) [16] and the Smart FileSystem [41]. These file systems monitor file access patterns atruntime and move frequently-accessed files to a reserved area ondisk. There are several major advantages of FS2 over these othertechniques. First, the use of block granularity allows us to opti-mize for all files as opposed to just the small ones. It also givesus opportunities to place meta-data and data blocks adjacent toone another according to the order that they were previously ac-cessed, which, as we have shown and also as indicated in [17], isvery important to the reduction of rotational delay. Second, mov-ing data can potentially break sequentiality. Instead, we replicatedisk blocks, so both sequential and random disk I/Os can benefit.Third, using a fixed location on disk to rearrange disk layout will

have benefit only under light load [25]. Using free disk space, weshow that even under heavy load, significant benefits can be ob-tained while keeping replication overhead low. Fourth, reservinga fixed amount of disk space to rearrange disk layout is intrusive,which can be prevented by using only the free disk space as wehave done in this paper.

Adaptive Disk Layout: Early adaptive disk layout techniqueshave been mostly studied either theoretically [46] or via simula-tion [10, 30, 36, 44]. It has been shown for random disk accessesthat the organ pipe heuristic, which places the most frequently-accessed data to the middle of the disk, is optimal [46]. However,as real workloads do not exhibit random disk patterns, the organpipe placement is not always optimal, and for various reasons, farfrom being practical. To adapt the organ pipe heuristic to real-istic workloads, Vongsathorn and Carson [44] proposed cylindershuffling. In their approach, the access frequency of each cylin-der is maintained, and based on this usage frequency, cylindersare reordered using the organ pipe heuristic at the end of eachday. Ruemmler and Wilkes [36] suggested shuffling in units ofblocks instead of cylinders, and it was shown to be more effec-tive. In all these techniques, it was assumed that the disk geome-try is known and can be easily simulated, but due to increasinglycomplex hard disk designs, disk geometry is now mostly hidden.There are tools [1, 14, 37, 42] that can be used to extract the physi-cal disk geometry experimentally, but the extraction usually takesa long time and the extracted information can be too tedious tobe used practically at runtime. Our work only uses disk logicalgeometry, and we have shown this to be sufficient in practice byverifying it with the extracted disk physical geometry.

Akyurek and Salem [2] were the first to report block shufflingin a real system. Unfortunately, other than its use of block gran-ularity instead of file granularity, this technique is very similar toHFS and Smart File System, and therefore, suffers from the samepitfalls as HFS and Smart File System.

I/O Scheduling: Disk performance can also be improved bymeans of better I/O scheduling. Seek time can be reduced by us-ing strategies such as SSTF and SCAN [6], C-SCAN [38], andLOOK [29]. Rotational delay can be reduced by taking the ap-proaches in [21, 34, 39, 24]. An anticipatory I/O scheduler [23]is used to avoid hasty I/O scheduling decisions by introducing anadditional wait time. These techniques are orthogonal to our work.

Energy Efficiency: Due mainly to higher rotational speeds andheavier platters, disk drives are consuming more energy. Vari-ous techniques [8, 7, 19, 26] have been developed to save energyby exploiting disk idleness. However, due to increasing utiliza-tion of disk drives, most idle periods are becoming too small tobe exploited for saving energy [20, 47]. Gurumurthi et al. [20]proposed a Dynamic RPM (DRPM) disk design, and in their sim-ulation, it was shown to be effective in saving energy even whenthe disk is fairly busy. However, as DRPM disks are not currentlypursued by any disk manufacturers, we cannot benefit from thisnew architecture. In this paper, we lowered power dissipation byreducing seek distance. Similar to the DRPM design, power canbe reduced even when disk is busy. However, unlike DRPM, FS2

does not suffer from any performance degradation. On the con-trary, FS2 actually improves performance while saving energy.

Utilization of Free Disk Space: Free disk space has been ex-ploited by others in the past for different purposes. The ElephantFilesystem [11] uses free disk space to automatically store old ver-sions of files to avoid explicit user-initiated version control. In

274

their system, a cleaner is used to free older file versions to reclaimdisk space, but it is not allowed to free Landmark files. In FS2,all free space used for duplication can be reclaimed as replicas areonly used to enhance performance and invalidating them will notcompromise the correctness of the system.

6. DISCUSSION

6.1 Degree of ReplicationOur implementation of FS2 allows system administrators to set

the maximum degree of replication; in an n-replica system, a diskblock can be replicated at most n times. According to this defini-tion, traditional file systems are 0-replica systems. There are cer-tain tradeoffs in choosing the parameter n. Obviously, with a largen, replicas can potentially consume more disk space. Also, thereis a cost to invalidate or free replicas. However, allowing a largerdegree of replication can more significantly improve performanceunder certain situations. One simple example is the placement ofshared libraries. Placing multiple copies of a shared library closeto each application image that make use of it, application startuptime can be reduced. It is interesting to find the “optimal” degreeof replication for a certain disk access pattern. However, in theworkloads that we have studied, due to lack of data/code sharing,the additional benefit from having multiple replicas per disk blockwould be minimal, and therefore, we simply used a 1-replica sys-tem in our evaluation. Lo [18] explored the effect of degree ofreplication in more details through a trace-driven analysis.

6.2 Reducing Replication OverheadReplicating data in free disk space that is near the current disk

head location minimizes interference with foreground tasks. Us-ing freeblock scheduling [27, 28], it is possible to further reducethis overhead by writing replicas to disk only during the rota-tional delay. However, as indicated by the experimental results,our heuristic is already fairly effective in keeping the replicationoverhead low.

6.3 Multi-Disk IssuesThe most straightforward way to improve disk performance is

to spread data across more disks. Redundant Array of InexpensiveDisks (RAID) was first introduced by Patterson et al. [33] as a wayto improve aggregated I/O performance by exploiting parallelismbetween multiple disks, and since then it has become widely used.

FS2 is operating orthogonally to RAID. The techniques thatwe have used to improve performance on a single disk can bedirectly applied to each of the disks in a RAID. Therefore, thesame amount of performance improvement that we have seen be-fore can still be achieved here. However, the problem gets moreinteresting when free disk space from multiple disks can be uti-lized. It can be used not only to improve performance, but alsoto enhance the fault-tolerance provided by RAID. For example, atraditional RAID-5 system can sustain only one disk failure be-fore data loss. However, if we are able to use free disk space onone disk to hold recently-modified data on other disks, it may bepossible that even with more than one disk failure, all data canstill be recovered—recently-modified data can be recovered fromreplicas on non-faulty disks and older data can be recovered froma nightly or weekly backup system.

This introduces another dimension where free disk space canbe utilized to our advantage. Now, we can use free space of a

disk to hold either replicas of its own disk blocks or replicas fromother disks. Both have performance benefits, but holding replicasfor other disks can also improve fault-tolerance. These issues arebeyond the scope of this paper. We are currently investigatingthese issues and building a RAID-based prototype.

7. CONCLUSIONIn this paper, we presented the design, implementation and eval-

uation of a new file system, called FS2, which contains a runtimecomponent responsible for dynamically reorganizing disk layout.The use of free disk space allows a flexible way to improve diskI/O performance and reduce its power dissipation, while beingnonintrusive to users.

To evaluate the effectiveness of FS2, we conducted experimentsusing a range of workloads. FS2 is shown to improve the client-perceived response time by 34% in the TPC-W benchmark. InSURGE, the average response time is improved by 18%. Even foreveryday tasks, FS2 is shown to provide significant performancebenefits: the time to complete a CVS update command is reducedby 31% and the time to start an X server and a KDE windowsmanager is reduced by 16%. As a result of reducing seek distance,energy dissipated due to seeking is also reduced. Overall, FS2 isshown to have reduced the total energy consumption of the diskby 1.9–15%.

We are currently working on an extension of this technique thatcomplements the level of fault-tolerance achieved by using RAID.In such a system, not only is free disk space useful in improv-ing performance and energy savings, it may also be possible toprovide additional fault-tolerance in addition to that provided byRAID, which can better safeguard the data stored on a RAID sys-tem when multiple failures occur.

8. ACKNOWLEDGMENTSWe thank the anonymous reviewers and our shepherd, Mike

Burrows, for their comments that helped us improve this paper.We also thank Karthick Rajamani at IBM Austin Research Labfor helping us setting up the TPC-W workload, Howard Tsai atThe University of Michigan for helping us setting up the SURGEworkload. This work is partially funded by AFOSR under grantsF49620-01-1-0120 and FA9550-05-0287.

9. REFERENCES[1] M. Aboutabl and A. Agrawala and J. Decotignie,

Temporally Determinate Disk Access: An ExperimentalApproach, Technical Report, CS-TR-3752, 1997.

[2] Sedat Akyurek and Kenneth Salem, Adaptive BlockRearrangement, Computer Systems, 13(2): 89–121, 1995.

[3] P. Barford and M. Crovella, Generating Representative WebWorkloads for Network and Server PerformanceEvaluation, Proceedings of the ACM Measurement andModeling of Computer Systems, 151–160, 1998.

[4] R. Card and T. Ts’o and S. Tweedle, Design andImplementation of the Second Extended Filesystem, FirstDutch International Symposium on Linux, 1994.

[5] Transactional Processing Performance Council,http://www.tpc.org/tpcw.

[6] P. J. Denning, Effects of Scheduling on File MemoryOperations, AFIPS Spring Joint Computer Conference,9–21, 1967.

275

[7] F. Douglis and P. Krishnan and B. Bershad, Adaptive DiskSpin-down Policies for Mobile Computers, Proceedings ofthe 2nd USENIX Symposium on Mobile andLocation-Independent Computing, 1995.

[8] F. Douglis and P. Krishnan and B. Marsh, Thwarting thePower-Hungry Disk, USENIX Winter, 292–306, 1994.

[9] John Douceur and William Bolosky, A Large-Scale Studyof File-System Contents, ACM SIGMETRICS PerformanceReview, 59–70, 1999.

[10] Robert English and Alexander Stepanov, Loge: ASelf-Organizing Disk Controller, Proceedings of the Winter1992 USENIX Conference, 1992.

[11] Douglas Santry et al., Deciding When to Forget in theElephant File System, ACM Symposium on OperatingSystems Principles, 110–123, 1999.

[12] M. K. McKusick et al., A Fast File System for UNIX, ACMTransactions on Computing Systems (TOCS), 2(3), 1984.

[13] Muthian Sivathanu et al., Life or Death at Block Level,Proceedings of the 6th Symposium on Operating SystemsDesign and Implementation, 379–394, 2004.

[14] Zoran Dimitrijevic et al., Diskbench: User-level DiskFeature Extraction Tool, Technical Report, UCSB, 2004.

[15] Linux Filesystem Performance Comparison for OLTP,http://oracle.com/technology/tech/linux/pdf/Linux-FS-Performance-Comparison.pdf.

[16] HFS Plus Volume Format,http://developer.apple.com/technotes/tn/tn1150.html.

[17] Greg Ganger and Frans Kaashoek, Embedded Inodes andExplicit Groups: Exploiting Disk Bandwidth for SmallFiles, USENIX Annual Technical Conference, 1–17. 1997.

[18] Sai-Lai Lo, Ivy: a Study on Replicating Data forPerformance Improvement, HP Technical Report,HPL-CSP-90-48, 1990.

[19] R. A. Golding and P. Bosch and C. Staelin and T. Sullivanand J. Wilkes, Idleness is Not Sloth, USENIX Winter,201–212, 1995.

[20] Gurumurthi et al., DRPM: Dynamic Speed Control forPower Management in Server Class Disks, Proceedings ofthe International Symposium on Computer Architecture(ISCA), 2003.

[21] L. Huang and T. Chiueh, Implementation of a RotationLatency Sensitive Disk Scheduler, Technical Report,ECSL-TR81, 2000.

[22] IBM DB2, http://www-306.ibm.com/software/data/db2.[23] Sitaram Iyer and Peter Druschel, Anticipatory Scheduling:

A Disk Scheduling Framework To Overcome DeceptiveIdleness in Synchronous I/O, 18th ACM Symposium onOperating Systems Principles (SOSP), 2001.

[24] D. Jacobson and J. Wilkes, Disk Scheduling AlgorithmsBased on Rotational Position, HP Technical Report,HPL-CSP-91-7, 1991.

[25] Richard King, Disk Arm Movement in Anticipation ofFuture Requests, ACM Transaction on Computer Systems,214–229, 1990.

[26] K. Li and R. Kumpf and P. Horton and T. E. Anderson,Quantitative Analysis of Disk Drive Power Management inPortable Computers, USENIX Winter, 279–291, 1994.

[27] C. Lumb and J. Schindler and G. Ganger, Freeblock

Scheduling Outside of Disk Firmware, Conference on Fileand Storage Technologies (FAST), 275–288, 2002.

[28] C. Lumb and J. Schindler and G. Ganger and D. Nagle andE. Riedel, Towards Higher Disk Head Utilization:Extracting Free Bandwidth from Busy Disk Drives,Proceedings of the Symposium on Operating SystemsDesign and Implementation, 2000.

[29] A. G. Merten, Some Quantitative Techniques for FileOrganization, Ph.D. Thesis, 1970.

[30] David Musser, Block Shuffling in Loge, HP TechnicalReport CSP-91-18, 1991.

[31] MySQL, http://www.mysql.com.[32] Douglas Orr and Jay Lepreau and J. Bonn and R.

Mecklenburg, Fast and Flexible Shared Libraries, USENIXSummer, 237–252, 1993.

[33] D. Patterson and G. Gibson and R. Katz, A Case forRedundant Arrays of Inexpensive Disks (RAID),Proceedings of ACM SIGMOD, 109–116, 1988.

[34] Lars Reuther and Martin Pohlack,Rotational-Position-Aware Real-Time Disk SchedulingUsing a Dynamic Active Subset (DAS), IEEE Real-TimeSystems Symposium, 2003.

[35] M. Rosenblum and J. Ousterhout, The Design andImplementation of a Log-Structured File System, ACMTransactions on Computer Systems, 26–52, 1992.

[36] C. Ruemmler and J. Wilkes, Disk Shuffling, HP TechnicalReport, HPL-CSP-91-30, 1991.

[37] J. Schindler and G. Ganger, Automated Disk DriveCharacterization, Technical Report CMU-CS-99-176, 1999.

[38] P. H. Seaman and R. A. Lind and T. L. Wilson, An Analysisof Auxiliary-Storage Activity, IBM System Journal, 5(3):158–170, 1966.

[39] M. Seltzer and P. Chen and J. Ousterhout, Disk SchedulingRevisited, USENIX Winter, 313–324, 1990.

[40] SQL Server, http://www.microsoft.com/sql/default.mspx.[41] C. Staelin and H. Garcia-Molina, Smart Filesystems,

USENIX Winter, 45–52, 1991.[42] N. Talagala and R. Arpaci-Dusseau and D. Patterson,

Microbenchmark-based Extraction of Local and GlobalDisk Characteristics, Technical Report CSD-99-1063, 1999.

[43] Stephen Tweedie, Journaling the Linux ext2fs Filesystem,LinuxExpo, 1998.

[44] P. Vongsathorn and S. D. Carson, A System for AdaptiveDisk Rearrangement, Software Practice Experience, 20(3):225–242, 1990.

[45] J. Wang and Y. Hu, PROFS – Performance-Oriented DataReorganization for Log-structured File System onMulti-Zone Disks, The 9th International Symposium onModeling, Analysis and Simulation on Computer andTelecommunication Systems, 285–293, 2001.

[46] C. K. Wong, Algorithmic Studies in Mass Storage SystemsComputer Sciences Press, 1983.

[47] Qingbo Zhu et al., Reducing Energy Consumption of DiskStorage Using Power-Aware Cache Management, The 10thInternational Symposium on High Performance ComputerArchitecture (HPCA-10), 2004.

276

fs2: dynamic data replication in free disk space for ...susanna/sisop/p263-huang.pdf · fs2:...

Documents