a survey of distributed file system technologygfs 2005 xrootd 2007 ceph 7/18. milestones in...

53
A Survey of Distributed File System Technology Jakob Blomer CERN PH/SFT and Stanford University ACAT 2014 Prague 1 / 18

Upload: others

Post on 26-Feb-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

A Survey of Distributed File System Technology

Jakob BlomerCERN PH/SFT and Stanford University

ACAT 2014Prague

1 / 18

Page 2: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

1 Survey and Taxonomy

2 Highlights of the DFS Toolbox

3 Trends

2 / 18

Page 3: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Distributed File Systems

1

A distributed file system (DFS) provides

1 persistent storage

2 of opaque data (files)

3 in a hierarchical namespace that isshared among networked nodes

∙ Files survive the lifetime of processes and nodes

∙ POSIX-like interface: open(), close(), read(), write(), . . .

∙ Typically transparent to applications

∙ Data model and interface distinguish a DFS froma distributed (No-)SQL database or a distributed key-value store

3 / 18

Page 4: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Distributed File Systems

1

Popular DFSs:

AFS, Ceph, CernVM-FS,dCache, EOS, FhGFS,GlusterFS, GPFS, HDFS,Lustre, MooseFS, XrootD

A distributed file system (DFS) provides

1 persistent storage

2 of opaque data (files)

3 in a hierarchical namespace that isshared among networked nodes

∙ Files survive the lifetime of processes and nodes

∙ POSIX-like interface: open(), close(), read(), write(), . . .

∙ Typically transparent to applications

∙ Data model and interface distinguish a DFS froma distributed (No-)SQL database or a distributed key-value store

3 / 18

Page 5: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Use Cases and Demands

MeanFile Size

Change FrequencyRequest Rate

MB/s

Request RateIOPS

Volume

Cache Hit RateRedundancy

Confidentiality

DataValue

[Data are illustrative]

Data Classes∙ Home folders

∙ Physics Data

∙ Recorded∙ Simulated∙ Analysis results

∙ Software binaries

∙ Scratch area

4 / 18

Page 6: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Use Cases and Demands

MeanFile Size

Change FrequencyRequest Rate

MB/s

Request RateIOPS

Volume

Cache Hit RateRedundancy

Confidentiality

DataValue

[Data are illustrative]

Data Classes∙ Home folders –∙ Physics Data

∙ Recorded∙ Simulated∙ Analysis results

∙ Software binaries

∙ Scratch area

4 / 18

Page 7: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Use Cases and Demands

MeanFile Size

Change FrequencyRequest Rate

MB/s

Request RateIOPS

Volume

Cache Hit RateRedundancy

Confidentiality

DataValue

[Data are illustrative]

Data Classes∙ Home folders –∙ Physics Data –

∙ Recorded∙ Simulated∙ Analysis results

∙ Software binaries

∙ Scratch area

4 / 18

Page 8: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Use Cases and Demands

MeanFile Size

Change FrequencyRequest Rate

MB/s

Request RateIOPS

Volume

Cache Hit RateRedundancy

Confidentiality

DataValue

[Data are illustrative]

Data Classes∙ Home folders –∙ Physics Data –

∙ Recorded∙ Simulated∙ Analysis results

∙ Software binaries –∙ Scratch area

4 / 18

Page 9: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Use Cases and Demands

MeanFile Size

Change FrequencyRequest Rate

MB/s

Request RateIOPS

Volume

Cache Hit RateRedundancy

Confidentiality

DataValue

[Data are illustrative]

Data Classes∙ Home folders –∙ Physics Data –

∙ Recorded∙ Simulated∙ Analysis results

∙ Software binaries –∙ Scratch area –

Depending on the use case, the dimensions span orders of magnitude(logarithmic axes)

4 / 18

Page 10: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Distributed File Systems and Use Cases

> ls .event_sample.rootanalysis.C

File system:“please take special care of this file!”

> ls //home/data/software/scratch Im

plic

itQ

oS

∙ It is difficult to perform wellunder usage characteristics thatdiffer by 4 orders of magnitude

∙ File system performance ishighly susceptible tocharacteristics of individualapplications

∙ There is no interface tospecify quality of service (QoS)for a particular file

We will deal with a number ofDFSs for the foreseeable future

5 / 18

Page 11: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Distributed File Systems and Use Cases

> ls .event_sample.rootanalysis.C

File system:“please take special care of this file!”

> ls //home/data/software/scratch Im

plic

itQ

oS

∙ It is difficult to perform wellunder usage characteristics thatdiffer by 4 orders of magnitude

∙ File system performance ishighly susceptible tocharacteristics of individualapplications

∙ There is no interface tospecify quality of service (QoS)for a particular file

∙ We will deal with a number ofDFSs for the foreseeable future

5 / 18

Page 12: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Architecture Sketches

∙∙∙

Network shares, client-server

/share

/share

Goals: Simplicity, separate storage from applicationExample: NFS 3

6 / 18

Page 13: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Architecture Sketches

∙∙∙

Namespace delegation

/physics

/physics/amssubtree

Goals: Scaling network shares, decentral administrationExample: AFS

6 / 18

Page 14: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Architecture Sketches

data

met

a-da

ta

∙ ∙ ∙ ∙∙∙

Object-based file system

delete()

create()

read()write()

Goals: Separate meta-data from data, incremental scalingExample: Google File System

6 / 18

Page 15: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Architecture Sketches

data

met

a-da

ta

∙ ∙ ∙

∙∙∙

Parallel file system

delete()

create()

read()write()

Goals: Maximum throughput, optimized for large filesExample: Lustre

6 / 18

Page 16: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Architecture Sketches

data

met

a-da

ta

∙ ∙ ∙

∙∙∙

Distributed meta-data

delete()

create()

read()write()

Goals: Avoid single point of failure and meta-data bottleneckExample: Ceph

6 / 18

Page 17: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Architecture Sketches

∙∙∙

Symmetric, peer-to-peer

/share

Goals: Conceptual simplicity, inherently scalableOn global scale, 𝒪(log n) routing precedes file access

Example: GlusterFS 6 / 18

Page 18: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Milestones in Distributed File SystemsBiased towards open-source, production file systems

1983

AFS

1985

NFS

1995

Zebra

2000

OceanStore

2002

Venti

2003

GFS

2005

XRootD

2007

Ceph

7 / 18

Page 19: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Milestones in Distributed File SystemsBiased towards open-source, production file systems

1983

AFS

1985

NFS

1995

Zebra

2000

OceanStore

2002

Venti

2003

GFS

2005

XRootD

2007

Ceph

1983

AFS

Andrew File SystemClient-server

∙ Roaming home folders

∙ Identity tokens andaccess control lists (ACLs)

∙ Subtree delegation

∙ Decentralized operation (AFS 3)

“AFS was the first safe and efficient distributed com-puting system, available [. . . ] on campus. It was aclear precursor to the Dropbox-like software pack-ages today. [. . . ] [It] allowed students (like DrewHouston and Arash Ferdowsi) access to all theirstuff from any connected computer.”

http://www.wired.com/2011/12/backdrop-dropbox

7 / 18

Page 20: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Milestones in Distributed File SystemsBiased towards open-source, production file systems

1983

AFS

1985

NFS

1995

Zebra

2000

OceanStore

2002

Venti

2003

GFS

2005

XRootD

2007

Ceph

1985

NFS

Network File SystemClient-server

∙ Focus on portability

∙ Separation of protocoland implementation

∙ Stateless servers

∙ Fast crash recovery∙ State re-introduced for

locking

Sandberg, Goldberg, Kleiman, Walsh, Lyon (1985)

7 / 18

Page 21: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Milestones in Distributed File SystemsBiased towards open-source, production file systems

1983

AFS

1985

NFS

1995

Zebra

2000

OceanStore

2002

Venti

2003

GFS

2005

XRootD

2007

Ceph

1995

Zebra

Zebra File SystemParallel

∙ Striping and parity

∙ Redundant array ofinexpensive nodes (RAIN)

∙ Log-structured data

∙ Separation of data andmeta-data

sequential transfers. LFS is particularly effective for writingsmall files, since it can write many files in a single transfer;in contrast, traditional file systems require at least twoindependent disk transfers for each file. Rosenblum reporteda tenfold speedup over traditional file systems for writingsmall files. LFS is also well-suited for RAIDs because itbatches small writes together into large sequential transfersand avoids the expensive parity updates associated withsmall random writes.

Zebra can be thought of as a log-structured network filesystem: whereas LFS uses the logging approach at theinterface between a file server and its disks, Zebra uses thelogging approach at the interface between a client and itsservers. Figure 4 illustrates this approach, which we call

per-client

striping. Each Zebra client organizes its new filedata into an append-only log, which it then stripes across theservers. The client computes parity for the log, not forindividual files. Each client creates its own log, so a singlestripe in the file system contains data written by a singleclient.

Per-client striping has a number of advantages over per-file striping. The first is that the servers are used efficientlyregardless of file sizes: large writes are striped, allowingthem to be completed in parallel, and small writes arebatched together by the log mechanism and written to theservers in large transfers; no special handling is needed foreither case. Second, the parity mechanism is simplified.Each client computes parity for its own log without fear ofinteractions with other clients. Small files do not haveexcessive parity overhead because parity is not computedon a per-file basis. Furthermore, parity never needs to beupdated because file data are never overwritten in place.

The above introduction to per-client striping leavessome unanswered questions. For example, how can files beshared between client workstations if each client is writingits own log? Zebra solves this problem by introducing acentral

file manager

, separate from the storage servers, thatmanages metadata such as directories and file attributes andsupervises interactions between clients. Also, how is free

63

52

41

64

File Servers

1

2

34

5

6

1 2 3 5

File BFile A

File CFile D

Client’s Log

Figure 4

.

Per-client striping in Zebra

. Each clientforms its new file data into a single append-only log andstripes this log across the servers. In this example file Aspans several servers while file B is stored entirely on asingle server. Parity is computed for the log, not forindividual files.

space reclaimed from the logs? Zebra solves this problemwith a

stripe cleaner

, which is analogous to the cleaner in alog-structured file system. The next section provides a moredetailed discussion of these issues and several others.

3 Zebra Components

The Zebra file system contains four main componentsas shown in Figure 5:

clients

, which are the machines thatrun application programs;

storage servers

, which store filedata; a

file manager

, which manages the file and directorystructure of the file system; and a

stripe cleaner

, whichreclaims unused space on the storage servers. There may beany number of clients and storage servers but only a singlefile manager and stripe cleaner. More than one of thesecomponents may share a single physical machine; forexample, it is possible for one machine to be both a storageserver and a client. The remainder of this section describeseach of the components in isolation; Section 4 then showshow the components work together to implement operationssuch as reading and writing files, and Sections 5 and 6describe how Zebra deals with crashes.

We will describe Zebra under the assumption that thereare several storage servers, each with a single disk.However, this need not be the case. For example, storageservers could each contain several disks managed as aRAID, thereby giving the appearance to clients of a singledisk with higher capacity and throughput. It is also possibleto put all of the disks on a single server; clients would treatit as several logical servers, all implemented by the samephysical machine. This approach would still provide manyof Zebra’s benefits: clients would still batch small files fortransfer over the network, and it would still be possible toreconstruct data after a disk failure. However, a single-server Zebra system would limit system throughput to thatof the one server, and the system would not be able tooperate when the server is unavailable.

3.1 Clients

Clients are machines where application programsexecute. When an application reads a file the client must

Figure 5: Zebra schematic

. Clients run applications;storage servers store data. The file manager and thestripe cleaner can run on any machine in the system,although it is likely that one machine will run both ofthem. A storage server may also be a client.

Network

Client

FileManager

StripeCleaner

Client

StorageServer

StorageServer

StorageServer

StorageServer

Hartman, Ousterhout (1995)

7 / 18

Page 22: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Milestones in Distributed File SystemsBiased towards open-source, production file systems

1983

AFS

1985

NFS

1995

Zebra

2000

OceanStore

2002

Venti

2003

GFS

2005

XRootD

2007

Ceph

2000

OceanStore

OceanStorePeer-to-peer

∙ “Global Scale”:1010 users, 1014 files

∙ Untrusted infrastructure

∙ Based on peer-to-peeroverlay network

∙ Nomadic data throughaggressive caching

∙ Foundation for today’sdecentral dropbox replacements

∙Ideally, a user would entrust all of his or her data to OceanStore;

in return, the utility’s economies of scale would yield much betteravailability, performance, and reliability than would be availableotherwise. Further, the geographic distribution of servers wouldsupport deep archival storage, i.e. storage that would survive ma-jor disasters and regional outages. In a time when desktop worksta-tions routinely ship with tens of gigabytes of spinning storage, themanagement of data is far more expensive than the storage media.OceanStore hopes to take advantage of this excess of storage spaceto make the management of data seamless and carefree.

1.2 Two Unique GoalsThe OceanStore system has two design goals that differentiate itfrom similar systems: (1) the ability to be constructed from an un-trusted infrastructure and (2) support of nomadic data.

Untrusted Infrastructure: OceanStore assumes that the infras-tructure is fundamentally untrusted. Servers may crash withoutwarning or leak information to third parties. This lack of trust is in-herent in the utility model and is different from other cryptographicsystems such as [35]. Only clients can be trusted with cleartext—allinformation that enters the infrastructure must be encrypted. How-ever, rather than assuming that servers are passive repositories ofinformation (such as in CFS [5]), we allow servers to be able toparticipate in protocols for distributed consistency management. Tothis end, we must assume that most of the servers are working cor-rectly most of the time, and that there is one class of servers that wecan trust to carry out protocols on our behalf (but not trust with thecontent of our data). This responsible party is financially responsi-ble for the integrity of our data.

Nomadic Data: In a system as large as OceanStore, locality isof extreme importance. Thus, we have as a goal that data can becached anywhere, anytime, as illustrated in Figure 1. We call thispolicy promiscuous caching. Data that is allowed to flow freely iscalled nomadic data. Note that nomadic data is an extreme con-sequence of separating information from its physical location. Al-though promiscuous caching complicates data coherence and loca-tion, it provides great flexibility to optimize locality and to trade offconsistency for availability. To exploit this flexibility, continuousintrospective monitoring is used to discover tacit relationships be-tween objects. The resulting “meta-information” is used for local-ity management. Promiscuous caching is an important distinctionbetween OceanStore and systems such as NFS [43] and AFS [23]in which cached data is confined to particular servers in particularregions of the network. Experimental systems such as XFS [3] al-low “cooperative caching” [12], but only in systems connected bya fast LAN.

The rest of this paper is as follows: Section 2 gives a system-leveloverview of the OceanStore system. Section 3 shows sample ap-plications of the OceanStore. Section 4 gives more architecturaldetail, and Section 5 reports on the status of the current prototype.Section 6 examines related work. Concluding remarks are given inSection 7.

2 SYSTEM OVERVIEWAn OceanStore prototype is currently under development. This sec-tion provides a brief overview of the planned system. Details on theindividual system components are left to Section 4.The fundamental unit in OceanStore is the persistent object.

Each object is named by a globally unique identifier, or GUID.

pool

pool

pool

pool

pool

pool

pool

Figure 1: The OceanStore system. The core of the system iscomposed of a multitude of highly connected “pools”, amongwhich data is allowed to “flow” freely. Clients connect to one ormore pools, perhaps intermittently.

Objects are replicated and stored on multiple servers. This replica-tion provides availability1 in the presence of network partitions anddurability against failure and attack. A given replica is independentof the server on which it resides at any one time; thus we refer tothem as floating replicas.A replica for an object is located through one of two mecha-

nisms. First, a fast, probabilistic algorithm attempts to find theobject near the requesting machine. If the probabilistic algorithmfails, location is left to a slower, deterministic algorithm.Objects in the OceanStore are modified through updates. Up-

dates contain information about what changes to make to an ob-ject and the assumed state of the object under which those changeswere developed, much as in the Bayou system [13]. In principle,every update to an OceanStore object creates a new version2. Con-sistency based on versioning, while more expensive to implementthan update-in-place consistency, provides for cleaner recovery inthe face of system failures [49]. It also obviates the need for backupand supports “permanent” pointers to information.OceanStore objects exist in both active and archival forms. An

active form of an object is the latest version of its data togetherwith a handle for update. An archival form represents a permanent,read-only version of the object. Archival versions of objects areencoded with an erasure code and spread over hundreds or thou-sands of servers [18]; since data can be reconstructed from any suf-ficiently large subset of fragments, the result is that nothing shortof a global disaster could ever destroy information. We call thishighly redundant data encoding deep archival storage.An application writer views the OceanStore as a number of ses-

sions. Each session is a sequence of read and write requests relatedto one another through the session guarantees, in the style of theBayou system [13]. Session guarantees dictate the level of con-sistency seen by a session’s reads and writes; they can range fromsupporting extremely loose consistency semantics to supporting theACID semantics favored in databases. In support of legacy code,OceanStore also provides an array of familiar interfaces such as theUnix file system interface and a simple transactional interface.

If application semantics allow it, this availability is provided at the expenseof consistency.In fact, groups of updates are combined to create new versions, and weplan to provide interfaces for retiring old versions, as in the Elephant FileSystem [44].

2

Kubiatowicz et al. (2000)

7 / 18

Page 23: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Milestones in Distributed File SystemsBiased towards open-source, production file systems

1983

AFS

1985

NFS

1995

Zebra

2000

OceanStore

2002

Venti

2003

GFS

2005

XRootD

2007

Ceph

2002

Venti

VentiArchival storage

∙ De-duplication throughcontent-addressable storage

∙ Content hashes provideintrinsic file integrity

∙ Merkle trees verify thefile system hierarchy

overlapped and do not benefit from the striping of theindex. One possible solution is a form of read-ahead.When reading a block from the data log, it is feasible toalso read several following blocks. These extra blockscan be added to the caches without referencing theindex. If blocks are read in the same order they werewritten to the log, the latency of uncached indexlookups will be avoided. This strategy should work wellfor streaming data such as multimedia files.

The basic assumption in Venti is that the growth incapacity of disks combined with the removal ofduplicate blocks and compression of their contentsenables a model in which it is not necessary to reclaimspace by deleting archival data. To demonstrate why webelieve this model is practical, we present somestatistics derived from a decade’s use of the Plan 9 filesystem.

The computing environment in which we work includestwo Plan 9 file servers named bootes and emelie.Bootes was our primary file repository from 1990 until1997 at which point it was superseded by emelie. Overthe life of these two file servers there have been 522user accounts of which between 50 and 100 were activeat any given time. The file servers have hostednumerous development projects and also containseveral large data sets including chess end games,astronomical data, satellite imagery, and multimediafiles.

Figure 6 depicts the size of the active file system asmeasured over time by du, the space consumed on thejukebox, and the size of the jukebox’s data if it were tobe stored on Venti. The ratio of the size of the archivaldata and the active file system is also given. As can beseen, even without using Venti, the storage required toimplement the daily snapshots in Plan 9 is relativelymodest, a result of the block level incremental approachto generating a snapshot. When the archival data isstored to Venti the cost of retaining the snapshots isreduced significantly. In the case of the emelie filesystem, the size on Venti is only slightly larger than theactive file system; the cost of retaining the dailysnapshots is almost zero. Note that the amount ofstorage that Venti uses for the snapshots would be thesame even if more conventional methods were used toback up the file system. The Plan 9 approach tosnapshots is not a necessity, since Venti will removeduplicate blocks.

When stored on Venti, the size of the jukebox data isreduced by three factors: elimination of duplicateblocks, elimination of block fragmentation, andcompression of the block contents. Table 2 presents thepercent reduction for each of these factors. Note, bootesuses a 6 Kbyte block size while emelie uses 16 Kbyte,so the effect of removing fragmentation is moresignificant on emelie.

Bootes: storage size

0

50

100

150

200

250

300

Jul-90

Jan-91

Jul-91

Jan-92

Jul-92

Jan-93

Jul-93

Jan-94

Jul-94

Jan-95

Jul-95

Jan-96

Jul-96

Jan-97

Jul-97

Jan-98

Size

(Gb)

Emelie: storage size

050

100150200250300350400450

Jan-97

Jul-97

Jan-98

Jul-98

Jan-99

Jul-99

Jan-00

Jul-00

Jan-01

Jul-01

Size

(Gb)

JukeboxVentiActive file system

Bootes: ratio of archival to active data

00.5

11.5

22.5

33.5

44.5

5

Jul-90

Jan-91

Jul-91

Jan-92

Jul-92

Jan-93

Jul-93

Jan-94

Jul-94

Jan-95

Jul-95

Jan-96

Jul-96

Jan-97

Jul-97

Rat

io

Emelie: ratio of archival to active data

0

1

2

3

4

5

6

7

Jan-97

Jul-97

Jan-98

Jul-98

Jan-99

Jul-99

Jan-00

Jul-00

Jan-01

Jul-01

Rat

io

Jukebox / ActiveVenti / Active

Figure 6. Graphs of the various sizes of two Plan 9 file servers.Quinlan and Dorward (2002)

7 / 18

Page 24: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Milestones in Distributed File SystemsBiased towards open-source, production file systems

1983

AFS

1985

NFS

1995

Zebra

2000

OceanStore

2002

Venti

2003

GFS

2005

XRootD

2007

Ceph

2003

GFS

Google File SystemObject-based

∙ Co-designed for map-reduce

∙ Coalesce storage andcompute nodes

∙ Serialize data access

Google AppEngine documentation

7 / 18

Page 25: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Milestones in Distributed File SystemsBiased towards open-source, production file systems

1983

AFS

1985

NFS

1995

Zebra

2000

OceanStore

2002

Venti

2003

GFS

2005

XRootD

2007

Ceph

2005

XRootD

XRootDNamespace delegation

∙ Global tree of redirectors

∙ Flexibility throughdecomposition intopluggable components

∙ Namespace independentfrom data access

redirect

redirect open()

641 = 64

642 = 4096

Client open()

cmsd

xrootd

cmsd

xrootd

cmsd

xrootd

cmsd

xrootd

cmsd

xrootd

cmsd

xrootd

cmsd

xrootd

open()

Hanushevsky (2013)

7 / 18

Page 26: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Milestones in Distributed File SystemsBiased towards open-source, production file systems

1983

AFS

1985

NFS

1995

Zebra

2000

OceanStore

2002

Venti

2003

GFS

2005

XRootD

2007

Ceph

2007

Ceph

Ceph File System and RADOSDecentralized

∙ Peer-to-peer file systemat the cluster scale

∙ Incremental scalabilitythrough CRUSHdistributed hash function

∙ Adaptive workloaddistribution

choose(1,row)

choose(3,cabinet)

choose(1,disk)·········

··················

······

··················

···············

···············

··················

root

row2

row1 row3 row4

cab24cab21

cab22

cab23

Figure 5.1: A partial view of a four-level cluster map hierarchy consisting of rows, cabinets, andshelves of disks. Bold lines illustrate items selected by each select operation in the placementrule and fictitious mapping described by Table 5.1.

Action Resulting i⃗take(root) rootselect(1,row) row2select(3,cabinet) cab21 cab23 cab24select(1,disk) disk2107 disk2313 disk2437emit

Table 5.1: A simple rule that distributes three replicas across three cabinets in the same row.

descends through any intermediate buckets, pseudo-randomly selecting a nested item in each

bucket using the function c(r,x) (defined for each kind of bucket in Section 5.2.4), until it finds

an item of the requested type t. The resulting n|⃗i| distinct items are placed back into the input i⃗

and either form the input for a subsequent select(n,t) or are moved into the result vector with an

emit operation.

As an example, the rule defined in Table 5.1 begins at the root of the hierarchy in

Figure 5.1 and with the first select(1,row) chooses a single bucket of type “row” (it selects row2).

91

Weil (2007)

7 / 18

Page 27: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

A Taxonomy

DistributedFile Systems

SecurityMerkle-trees

Accesstokens

Encryp-tion

Efficiency

Caching

Striping

Log-structured

data

Scalability& Usability

Horizontalscalability

Dis-tributed

meta-data

Name-space

delegation

Fault-tolerance

Repli-cation

Erasurecodes

Check-pointing

Consensus

Leases

POSIXCom-

pliance

create(), unlink(), stat()read(), write(), seek()open(), close() essential

flock() difficult for DFSsrename()file ownershipopen unlinked fileshard linksdevice files, IPC files

8 / 18

Page 28: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

A Taxonomy

DistributedFile Systems

SecurityMerkle-trees

Accesstokens

Encryp-tion

Efficiency

Caching

Striping

Log-structured

data

Scalability& Usability

Horizontalscalability

Dis-tributed

meta-data

Name-space

delegation

Fault-tolerance

Repli-cation

Erasurecodes

Check-pointing

Consensus

Leases

POSIXCom-

pliance

create(), unlink(), stat()read(), write(), seek()open(), close() essential

flock() difficult for DFSsrename()file ownershipopen unlinked fileshard linksdevice files, IPC files

8 / 18

Page 29: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

1 Survey and Taxonomy

2 Highlights of the DFS Toolbox

3 Trends

9 / 18

Page 30: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Transformative Technologies

Challenges of today’s “petascale” file systems(Petabytes of data, 107 nodes, and 109 files)

File System Integrity

∙ Ensure file integrity over long time and over long distances

∙ Manage the evolving file system tree

∙ Detect subtree changes∙ Preserve consistent file system snapshots

Fault-Tolerance∙ Ensure quality of service under regularly failing hardware

Bandwidth Utilization∙ Efficient writing of small files on spinning disks

∙ Data structures that work throughout the memory hierarchy

10 / 18

Page 31: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Transformative Technologies

Challenges of today’s “petascale” file systems(Petabytes of data, 107 nodes, and 109 files)

File System Integrity

∙ Ensure file integrity over long time and over long distances

∙ Manage the evolving file system tree

∙ Detect subtree changes∙ Preserve consistent file system snapshots

Fault-Tolerance∙ Ensure quality of service under regularly failing hardware

Bandwidth Utilization∙ Efficient writing of small files on spinning disks

∙ Data structures that work throughout the memory hierarchy

10 / 18

Page 32: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Transformative Technologies

Challenges of today’s “petascale” file systems(Petabytes of data, 107 nodes, and 109 files)

File System Integrity

∙ Ensure file integrity over long time and over long distances

∙ Manage the evolving file system tree

∙ Detect subtree changes∙ Preserve consistent file system snapshots

Fault-Tolerance∙ Ensure quality of service under regularly failing hardware

Bandwidth Utilization∙ Efficient writing of small files on spinning disks

∙ Data structures that work throughout the memory hierarchy10 / 18

Page 33: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Data Integrity and File System Snapshots

Root (/)

Contents: A, B

hash(hA, hB)

/A

Contents: 𝛼, 𝛽

hash(h𝛼, h𝛽) =: hA

/A/𝛼

Contents: data𝛼

hash(data𝛼) =: h𝛼

/A/𝛽

Contents: data𝛽

hash(data𝛽) =: h𝛽

/B

Contents: ∅

hash(∅) =: hB

Merkle tree

∙ Hash tree with cryptographichash function providessecure identifier for sub trees

∙ It is easy to sign a small hashvalue (data authenticity)

∙ Efficient calculation of changes(fast replication)

∙ Bonus: versioning and datade-duplication

∙ Full potential together withcontent-addressable storage

∙ Self-verifying data chunks,trivial to distribute and cache

11 / 18

Page 34: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Data Integrity and File System Snapshots

hash(hA, hB)

Contents: hA [A], hB [B]

hash(h𝛼, h𝛽)=: hA

Contents: h𝛼 [𝛼], h𝛽 [𝛽]

hash(data𝛼)=: h𝛼

Contents: data𝛼

hash(data𝛽)=: h𝛽

Contents: data𝛽

hash(∅)=: hB

Contents: ∅

Merkle tree⊕

Content-addressable storage

∙ Hash tree with cryptographichash function providessecure identifier for sub trees

∙ It is easy to sign a small hashvalue (data authenticity)

∙ Efficient calculation of changes(fast replication)

∙ Bonus: versioning and datade-duplication

∙ Full potential together withcontent-addressable storage

∙ Self-verifying data chunks,trivial to distribute and cache

11 / 18

Page 35: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Fault Tolerance and Data Reliability

Fundamental problems as the number of components grows

1 Faults are the norm

2 Faults are often correlated

3 No safe way to distinguish temporary unavailability from faults

Data Reliability Techniques

∙ Replication: simple and fastbut large storage overhead

∙ Trend from random placementto “de-correlation”

∙ Erasure codes: any n out of n + k blockscan reconstruct data [e. g. EOS, QFS]Different trade-offs between computationalcomplexity and storage overhead

∙ Checksums: detect silent data corruption

Engineering Challenges

∙ Fault detection

∙ Automatic and fastrecovery

∙ Failure predictione. g. based on MTTFand Markov models

12 / 18

Page 36: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Fault Tolerance and Data Reliability

Fundamental problems as the number of components grows

1 Faults are the norm

2 Faults are often correlated

3 No safe way to distinguish temporary unavailability from faults

Example from Google data centersimpact under a variety of replication schemes andplacement policies. (Sections 5 and 6)

• Formulate a Markov model for data availability, thatcan scale to arbitrary cell sizes, and captures the in-teraction of failures with replication policies and re-covery times. (Section 7)

• Introduce multi-cell replication schemes and com-pare the availability and bandwidth trade-offsagainst single-cell schemes. (Sections 7 and 8)

• Show the impact of hardware failure on our cells issignificantly smaller than the impact of effectivelytuning recovery and replication parameters. (Sec-tion 8)

Our results show the importance of consideringcluster-wide failure events in the choice of replicationand recovery policies.

2 Background

We study end to end data availability in a cloud com-puting storage environment. These environments oftenuse loosely coupled distributed storage systems such asGFS [1, 16] due to the parallel I/O and cost advantagesthey provide over traditional SAN and NAS solutions. Afew relevant characteristics of such systems are:

• Storage server programs running on physical ma-chines in a datacenter, managing local disk storageon behalf of the distributed storage cluster. We referto the storage server programs as storage nodes ornodes.

• A pool of storage service masters managing dataplacement, load balancing and recovery, and moni-toring of storage nodes.

• A replication or erasure code mechanism for userdata to provide resilience to individual componentfailures.

A large collection of nodes along with their higherlevel coordination processes [17] are called a cell orstorage cell. These systems usually operate in a sharedpool of machines running a wide variety of applications.A typical cell may comprise many thousands of nodeshoused together in a single building or set of colocatedbuildings.

2.1 Availability

A storage node becomes unavailable when it fails to re-spond positively to periodic health checking pings sent

0

20

40

60

80

100

1s 10s 1min 15min 1h 6h 1d 7d 1mon

Even

ts (%

)

Unavailability event duration

Figure 1: Cumulative distribution function of the duration ofnode unavailability periods.

by our monitoring system. The node remains unavail-able until it regains responsiveness or the storage systemreconstructs the data from other surviving nodes.

Nodes can become unavailable for a large number ofreasons. For example, a storage node or networkingswitch can be overloaded; a node binary or operatingsystem may crash or restart; a machine may experiencea hardware error; automated repair processes may tem-porarily remove disks or machines; or the whole clus-ter could be brought down for maintenance. The vastmajority of such unavailability events are transient anddo not result in permanent data loss. Figure 1 plots theCDF of node unavailability duration, showing that lessthan 10% of events last longer than 15 minutes. Thisdata is gathered from tens of Google storage cells, eachwith 1000 to 7000 nodes, over a one year period. Thecells are located in different datacenters and geographi-cal regions, and have been used continuously by differentprojects within Google. We use this dataset throughoutthe paper, unless otherwise specified.

Experience shows that while short unavailabilityevents are most frequent, they tend to have a minor im-pact on cluster-level availability and data loss. This isbecause our distributed storage systems typically addenough redundancy to allow data to be served from othersources when a particular node is unavailable. Longerunavailability events, on the other hand, make it morelikely that faults will overlap in such a way that datacould become unavailable at the cluster level for longperiods of time. Therefore, while we track unavailabil-ity metrics at multiple time scales in our system, in thispaper we focus only on events that are 15 minutes orlonger. This interval is long enough to exclude the ma-jority of benign transient events while not too long to ex-clude significant cluster-wide phenomena. As in [11], weobserve that initiating recovery after transient failures isinefficient and reduces resources available for other op-erations. For these reasons, GFS typically waits 15 min-utes before commencing recovery of data on unavailablenodes.

2

Ford et al. (2010) Link

Statistical separationof temporary andpermanent faults

12 / 18

Page 37: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Log-Structured DataRum

ble,

Kej

riwal

,O

uste

rhou

t(2

014)

USENIX Association 12th USENIX Conference on File and Storage Technologies 3

duce pause times, but memory consumption increased byan additional 30% and we still experienced occasionalpauses of one second or more.

An ideal memory allocator for a DRAM-based storagesystem such as RAMCloud should have two properties.First, it must be able to copy objects in order to elimi-nate fragmentation. Second, it must not require a globalscan of memory: instead, it must be able to perform thecopying incrementally, garbage collecting small regionsof memory independently with cost proportional to thesize of a region. Among other advantages, the incremen-tal approach allows the garbage collector to focus on re-gions with the most free space. In the rest of this paperwe will show how a log-structured approach to memorymanagement achieves these properties.

In order for incremental garbage collection to work, itmust be possible to find the pointers to an object with-out scanning all of memory. Fortunately, storage systemstypically have this property: pointers are confined to in-dex structures where they can be located easily. Tradi-tional storage allocators work in a harsher environmentwhere the allocator has no control over pointers; the log-structured approach could not work in such environments.

3 RAMCloud OverviewOur need for a memory allocator arose in the context

of RAMCloud. This section summarizes the features ofRAMCloud that relate to its mechanisms for storage man-agement, and motivates why we used log-structured mem-ory instead of a traditional allocator.

RAMCloud is a storage system that stores data in theDRAM of hundreds or thousands of servers within a dat-acenter, as shown in Figure 2. It takes advantage of low-latency networks to offer remote read times of 5µs andwrite times of 16µs (for small objects). Each storageserver contains two components. A master module man-ages the main memory of the server to store RAMCloudobjects; it handles read and write requests from clients. Abackup module uses local disk or flash memory to storebackup copies of data owned by masters on other servers.The masters and backups are managed by a central coor-dinator that handles configuration-related issues such ascluster membership and the distribution of data among theservers. The coordinator is not normally involved in com-mon operations such as reads and writes. All RAMClouddata is present in DRAM at all times; secondary storageis used only to hold duplicate copies for crash recovery.

RAMCloud provides a simple key-value data modelconsisting of uninterpreted data blobs called objects thatare named by variable-length keys. Objects are groupedinto tables that may span one or more servers in the clus-ter. Objects must be read or written in their entirety.RAMCloud is optimized for small objects – a few hun-dred bytes or less – but supports objects up to 1 MB.

Each master’s memory contains a collection of objectsstored in DRAM and a hash table (see Figure 3). The

Coordinator

Master

BackupDisk

Master

BackupDisk

Master

BackupDisk

Master

BackupDisk

. . .

Client Client Client Client. . .

Datacenter Network

Figure 2: RAMCloud cluster architecture.

Log-structured Memory

Master

Hash Table

. . .

Backup

Buffered Segment

Backup

Buffered Segment

. . .

Disk Disk

<table, key>

Segments

Figure 3: Master servers consist primarily of a hash table andan in-memory log, which is replicated across several backupsfor durability.

hash table contains one entry for each object stored on thatmaster; it allows any object to be located quickly, givenits table and key. Each live object has exactly one pointer,which is stored in its hash table entry.

In order to ensure data durability in the face of servercrashes and power failures, each master must keep backupcopies of its objects on the secondary storage of otherservers. The backup data is organized as a log for max-imum efficiency. Each master has its own log, which isdivided into 8 MB pieces called segments. Each segmentis replicated on several backups (typically two or three).A master uses a different set of backups to replicate eachsegment, so that its segment replicas end up scatteredacross the entire cluster.

When a master receives a write request from a client, itadds the new object to its memory, then forwards informa-tion about that object to the backups for its current headsegment. The backups append the new object to segmentreplicas stored in nonvolatile buffers; they respond to themaster as soon as the object has been copied into theirbuffer, without issuing an I/O to secondary storage (back-ups must ensure that data in buffers can survive powerfailures). Once the master has received replies from allthe backups, it responds to the client. Each backup accu-mulates data in its buffer until the segment is complete.At that point it writes the segment to secondary storageand reallocates the buffer for another segment. This ap-proach has two performance advantages: writes completewithout waiting for I/O to secondary storage, and backupsuse secondary storage bandwidth efficiently by perform-ing I/O in large blocks, even if objects are small.

Idea: Store allmodifications in asegmented log

Use full hardwarebandwidth forsmall objects

Used by

∙ Zebra experimental DFS

∙ Commercial filers (e. g. NetApp)

∙ Key-value stores(log-structured merge trees)

∙ File systems for flash memory

Advantages

∙ Minimal seek, in-place updates

∙ Fast and robust crash recovery

∙ Efficient allocation in DRAM,flash, and disks

∙ Large potential for meta-data

13 / 18

Page 38: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Log-Structured DataRum

ble,

Kej

riwal

,O

uste

rhou

t(2

014)

USENIX Association 12th USENIX Conference on File and Storage Technologies 3

duce pause times, but memory consumption increased byan additional 30% and we still experienced occasionalpauses of one second or more.

An ideal memory allocator for a DRAM-based storagesystem such as RAMCloud should have two properties.First, it must be able to copy objects in order to elimi-nate fragmentation. Second, it must not require a globalscan of memory: instead, it must be able to perform thecopying incrementally, garbage collecting small regionsof memory independently with cost proportional to thesize of a region. Among other advantages, the incremen-tal approach allows the garbage collector to focus on re-gions with the most free space. In the rest of this paperwe will show how a log-structured approach to memorymanagement achieves these properties.

In order for incremental garbage collection to work, itmust be possible to find the pointers to an object with-out scanning all of memory. Fortunately, storage systemstypically have this property: pointers are confined to in-dex structures where they can be located easily. Tradi-tional storage allocators work in a harsher environmentwhere the allocator has no control over pointers; the log-structured approach could not work in such environments.

3 RAMCloud OverviewOur need for a memory allocator arose in the context

of RAMCloud. This section summarizes the features ofRAMCloud that relate to its mechanisms for storage man-agement, and motivates why we used log-structured mem-ory instead of a traditional allocator.

RAMCloud is a storage system that stores data in theDRAM of hundreds or thousands of servers within a dat-acenter, as shown in Figure 2. It takes advantage of low-latency networks to offer remote read times of 5µs andwrite times of 16µs (for small objects). Each storageserver contains two components. A master module man-ages the main memory of the server to store RAMCloudobjects; it handles read and write requests from clients. Abackup module uses local disk or flash memory to storebackup copies of data owned by masters on other servers.The masters and backups are managed by a central coor-dinator that handles configuration-related issues such ascluster membership and the distribution of data among theservers. The coordinator is not normally involved in com-mon operations such as reads and writes. All RAMClouddata is present in DRAM at all times; secondary storageis used only to hold duplicate copies for crash recovery.

RAMCloud provides a simple key-value data modelconsisting of uninterpreted data blobs called objects thatare named by variable-length keys. Objects are groupedinto tables that may span one or more servers in the clus-ter. Objects must be read or written in their entirety.RAMCloud is optimized for small objects – a few hun-dred bytes or less – but supports objects up to 1 MB.

Each master’s memory contains a collection of objectsstored in DRAM and a hash table (see Figure 3). The

Coordinator

Master

BackupDisk

Master

BackupDisk

Master

BackupDisk

Master

BackupDisk

. . .

Client Client Client Client. . .

Datacenter Network

Figure 2: RAMCloud cluster architecture.

Log-structured Memory

Master

Hash Table

. . .

Backup

Buffered Segment

Backup

Buffered Segment

. . .

Disk Disk

<table, key>

Segments

Figure 3: Master servers consist primarily of a hash table andan in-memory log, which is replicated across several backupsfor durability.

hash table contains one entry for each object stored on thatmaster; it allows any object to be located quickly, givenits table and key. Each live object has exactly one pointer,which is stored in its hash table entry.

In order to ensure data durability in the face of servercrashes and power failures, each master must keep backupcopies of its objects on the secondary storage of otherservers. The backup data is organized as a log for max-imum efficiency. Each master has its own log, which isdivided into 8 MB pieces called segments. Each segmentis replicated on several backups (typically two or three).A master uses a different set of backups to replicate eachsegment, so that its segment replicas end up scatteredacross the entire cluster.

When a master receives a write request from a client, itadds the new object to its memory, then forwards informa-tion about that object to the backups for its current headsegment. The backups append the new object to segmentreplicas stored in nonvolatile buffers; they respond to themaster as soon as the object has been copied into theirbuffer, without issuing an I/O to secondary storage (back-ups must ensure that data in buffers can survive powerfailures). Once the master has received replies from allthe backups, it responds to the client. Each backup accu-mulates data in its buffer until the segment is complete.At that point it writes the segment to secondary storageand reallocates the buffer for another segment. This ap-proach has two performance advantages: writes completewithout waiting for I/O to secondary storage, and backupsuse secondary storage bandwidth efficiently by perform-ing I/O in large blocks, even if objects are small.

Idea: Store allmodifications in asegmented log

Use full hardwarebandwidth forsmall objects

Used by

∙ Zebra experimental DFS

∙ Commercial filers (e. g. NetApp)

∙ Key-value stores(log-structured merge trees)

∙ File systems for flash memory

Advantages

∙ Minimal seek, in-place updates

∙ Fast and robust crash recovery

∙ Efficient allocation in DRAM,flash, and disks

∙ Large potential for meta-data13 / 18

Page 39: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

1 Survey and Taxonomy

2 Highlights of the DFS Toolbox

3 Trends

14 / 18

Page 40: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Do we get what we need?

Typical physics experiment cluster

∙ Up to 1 000 standard nodes

∙ 1 GbE or 10 GbE network,close to full bisection bandwidth

Goals for a DFS for analysis applications

∙ At least 90 % available disk capacity

∙ At least 50 % of maximumaggregated throughput

∙ Symmetric architecture, fully decentralized

∙ Fault-tolerant to a small numberof disk/node failures

Bruce Allen, MPI for Gravitational Physics (2009) Link

15 / 18

Page 41: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Complexity and Decomposition in Distributed File Systems

Complexity

∙ Open source DFSs comprise some 100 kLOC to 750 kLOC

∙ It takes a community effort to stabilize them

∙ Once established, it can become prohibitively expensiveto move to a different DFS (data lock-in)

∙ At least 5 years from inception to widespread adoption

Decomposition

∙ Good track record of “outsourcing” taskse. g. authentication (Kerberos), distributed coordination (ZooKeeper)

∙ Ongoing: separation of namespace and data access

⊕ Allows for incremental improvements

⊖ We introduce more friction by exposing more interfacesA challenge: maintain sufficiently high level of POSIX compliance

16 / 18

Page 42: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Complexity and Decomposition in Distributed File Systems

Complexity

∙ Open source DFSs comprise some 100 kLOC to 750 kLOC

∙ It takes a community effort to stabilize them

∙ Once established, it can become prohibitively expensiveto move to a different DFS (data lock-in)

∙ At least 5 years from inception to widespread adoption

Decomposition

∙ Good track record of “outsourcing” taskse. g. authentication (Kerberos), distributed coordination (ZooKeeper)

∙ Ongoing: separation of namespace and data access

⊕ Allows for incremental improvements

⊖ We introduce more friction by exposing more interfacesA challenge: maintain sufficiently high level of POSIX compliance

16 / 18

Page 43: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Combining Storage and Compute Nodes

×1

×10

×100

×1000

1994 2004 2014

Incr

ease

Year

HDDDRAM

∙ Bandwidth� Capacity

∙ “Exascale” computing(EB of data, 1018 ops/s)envisaged by 2020

∙ Storage capacity andbandwidth scale atdifferent pace

∙ It will be more difficult orimpossible to constantly“move data in and out”

Ethernet bandwidth scaledsimilarly to capacity

Yielding segregation ofstorage and computing toa symmetric DFS can bepart of a solution

17 / 18

Page 44: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Combining Storage and Compute Nodes

×1

×10

×100

×1000

1994 2004 2014

Incr

ease

Year

HDDDRAM

Ethernet

∙ Bandwidth� Capacity

∙ “Exascale” computing(EB of data, 1018 ops/s)envisaged by 2020

∙ Storage capacity andbandwidth scale atdifferent pace

∙ It will be more difficult orimpossible to constantly“move data in and out”

∙ Ethernet bandwidth scaledsimilarly to capacity

Yielding segregation ofstorage and computing toa symmetric DFS can bepart of a solution

17 / 18

Page 45: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Combining Storage and Compute Nodes

×1

×10

×100

×1000

1994 2004 2014

Incr

ease

Year

HDDDRAM

Ethernet

∙ Bandwidth� Capacity

∙ “Exascale” computing(EB of data, 1018 ops/s)envisaged by 2020

∙ Storage capacity andbandwidth scale atdifferent pace

∙ It will be more difficult orimpossible to constantly“move data in and out”

∙ Ethernet bandwidth scaledsimilarly to capacity

∙ Yielding segregation ofstorage and computing toa symmetric DFS can bepart of a solution

17 / 18

Page 46: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Combining Storage and Compute Nodes

Example from HPC

Checkpointing is the state-of-the-art technique for delivering fault tolerance for high-performance computing (HPC) on large-scale systems. It has been shown that writing the state of a process to persistent storage is the largest contributor to the performance overhead of checkpointing [12]. Let's assume that parallel file systems are expected to continue to improve linearly in throughput with the growing number of nodes (an optimistic assumption); however, note that per node memory (the amount of state that is to be saved) will likely continue to grow exponentially. Assume the MTTF is modeled after the BlueGene with a claimed 1000 years MTTF per node (another optimistic assumption). Figure 5 shows the expected MTTF of about 7 days for a 0.8 petaflop IBM BlueGene/P with 64K nodes, while the checkpointing overhead is about 34 minutes; at 1M nodes (~1 exaflop), the MTTF would be only 8.4 hours, while the checkpointing overhead would be over 9 hours.

Figure 5: Expected checkpointing cost and MTTF towards

exascale Simulations results (see Figure 6) from 1K nodes to 2M nodes (simulating HEC from 2000 to 2019) show the application uptime collapse for capability HEC systems. Today, 20% or more of the computing capacity in a large HPC system is wasted due to failures and recoveries [12], which is confirmed by the simulation results.

Figure 6: Simulation application uptime towards exascale

The distributed file system shown in the simulation results is a hypothetical file system that could scale near linearly by preserving the data locality in checkpointing. The application

requirements are modeled to include 7 days of computing on the entire HEC system. By 1M nodes, checkpointing with parallel file systems (red line with circle symbol) has a complete performance collapse (as was predicted from Figure 5). Note that the hypothetical distributed file system (blue line with hollow circle symbol) could still have 90%+ uptime even at exascale levels.

3. RADICAL NEW VISION We believe the HEC community needs to develop both the theoretical and practical aspects of building efficient and scalable distributed storage for HEC systems that will scale four orders of magnitude in concurrency. We believe that emerging distributed file systems could co-exist with existing parallel file systems, and could be optimized to support a subset of workloads critical for HPC and MTC workloads at exascale. There have been other distributed file systems proposed in the past, such as Google's GFS [13] and Yahoo's HDFS [14]; however these have not been widely adopted in HEC due to the workloads, data access patterns, and supported interfaces (POSIX [15]) being quite different. Current file systems lack scalable distributed metadata management, and well defined interfaces to expose data locality for general computing frameworks that are not constrained by the map-reduce model to allow data-aware job scheduling with batch schedulers (e.g. PBS [16], SGE [17], Condor [18], Falkon [19]). We believe that future HEC systems should be designed with non-volatile memory (e.g. solid state memory [20], phase change memory [21]) on every compute node (Figure 7); every compute node would actively participate in the metadata and data management, leveraging the abundance of computational power many-core processors will have and the many orders of magnitude higher bisection bandwidth in multi-dimensional torus networks as compared to available cost effective bandwidth into remote network persistent storage. This shift in large-scale storage architecture design will lead to improving application performance and scalability for some of the most demanding applications. This shift in design is controversial as it requires storage architectures in HEC to be redefined from their traditional architectures of the past several decades. This new approach was not feasible up to recently due to the unreliable spinning disk technology [22] that has dominated the persistent storage space since the dawn of supercomputing. However, the advancements in solid-state memory (with MTTF of over two million hours [23]) have opened up opportunities to rethink storage systems for HEC, distributing storage on every compute node, without sacrificing node reliability or power consumption. The benefits of this new architecture lies in its enabling of some workloads to scale near-linearly with systems scales by leveraging data locality and the full network bisection bandwidth.

Figure 7: Proposed storage architecture with persistent local

storage on every compute node

0.1

1

10

100

1000

System

�MTTF�(hou

rs)

Checkpointing�Overhead�(hou

rs)

System�Scale�(#�of�Nodes)

System�MTTF�(hours)Checkpoint�Overhead�(hours)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

App

lication�Uptim

e�%

Scale�(#�of�nodes)

No�CheckpointingCheckpointing�to�Parallel�File�SystemCheckpointing�to�Distributed�File�System

2000BG/L

1024�nodes

2007BG/L

106,496�nodes�

2009BG/P

73,728�nodes�

2019~1,000,000��nodes

13

Raicu, Foster, Beckman (2011) Link

∙ “Exascale” computing(EB of data, 1018 ops/s)envisaged by 2020

∙ Storage capacity andbandwidth scale atdifferent pace

∙ It will be more difficult orimpossible to constantly“move data in and out”

∙ Ethernet bandwidth scaledsimilarly to capacity

∙ Yielding segregation ofstorage and computing toa symmetric DFS can bepart of a solution

17 / 18

Page 47: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

The Road Ahead

Distributed File Systems

∙ Seamless integration with applications

∙ Feature-rich: quotas, permissions, links

∙ Hierarchical organization allows forglobally federated storage

Distributed Key-Value and Blob Stores

∙ Very good scalability on local networks

∙ Flat namespace + search facility

∙ Turns out that certain types of datadon’t need a hierarchical namespacee. g. cached objects, media, VM images

BuildingBlocks

18 / 18

Page 48: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

http

://l

onel

ycha

irsa

tcer

n.tu

mblr

.com

Page 49: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Backup Slides

Page 50: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Source of Hardware Bandwidth and Capacity NumbersMethod and entries marked † from Patterson (2004) Link

Hard Disk Drives DRAM EthernetYear Capacity Bandwidth Capacity Bandwidth Bandwidth

1993 16Mibit/chip† 267MiB/s†

1994 4.3GB† 9MB/s†

1995 100Mbit/s†

2003 73.4GB† 86MB/s† 10Gbit/s†2004 512Mibit/chip 3.2 GiB/s

2014 6TB 220MB/s‡ 8Gibit/chip 25.6 GiB/s 100Gbit/sIncrease ×1395 ×24 ×512 ×98 ×1000

‡http://www.storagereview.com/seagate_enterprise_capacity_6tb_35_sas_hdd_review_v4

HDD: Seagate ST15150 (1994)†, Seagate 373453 (2004)†,Seagate ST6000NM0034 (2014)

DRAM: Fast Page DRAM (1993)†, DDR2-400 (2004), DDR4-3200 (2014)Ethernet: Fast Ethernet IEEE 802.3u (1995)†, 10 GbitE IEEE 802.3ae (2003)†,

100 GbitE IEEE 802.3bj (2014)

High-end SSD example: Hitachi FlashMAX III (2014), ≈2 TB, ≈2 GB/s

21 / 18

Page 51: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Bibliography I

Survey Articles

Satyanarayanan, M. (1990).A survey of distributed file systems.Annual Review of Computer Science, 4(1):73–104.

Guan, P., Kuhl, M., Li, Z., and Liu, X. (2000).A survey of distributed file systems.University of California, San Diego.

Agarwal, P. and Li, H. C. (2003).A survey of secure, fault-tolerant distributed file systems.URL:http://www.cs.utexas.edu/users/browne/cs395f2003/projects/LiAgarwalReport.pdf.

Thanh, T. D., Mohan, S., Choi, E., Kim, S., and Kim, P. (2008).A taxonomy and survey on distributed file systems.In Proc. int. conf. on Networked Computing and Advanced InformationManagement (NCM’08), pages 144 – 149.

Depardon, B., Séguin, C., and Mahec, G. L. (2013).Analysis of six distributed file systems.Technical Report hal-00789086, Université de Picardie Jules Verne.

22 / 18

Page 52: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Bibliography II

File Systems

Sandberg, R., Goldberg, D., Kleiman, S., Walsh, D., and Lyon, B. (1985).Design and implementation of the sun network filesystem.In Proc. of the Summer USENIX conference, pages 119–130.

Morris, J. H., Satyanarayanan, M., Conner, M. H., Howard, J. H., Rosenthal, D.S. H., and Smith, F. D. (1986).Andrew: A distributed personal computing environment.Communications of the ACM, 29(3):184–201.

Hartman, J. H. and Osterhout, J. K. (1995).The Zebra striped network file system.ACM Transactions on Computer Systems, 13(3):274–310.

Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D.,Gummadi, R., Rhea, S., Weatherspoon, H., Weimer, W., Wells, C., and Zhao, B.(2000).OceanStore: An architecture for global-scale persistent storage.ACM SIGPLAN Notices, 35(11):190–201.

23 / 18

Page 53: A Survey of Distributed File System TechnologyGFS 2005 XRootD 2007 Ceph 7/18. Milestones in Distributed File Systems Biasedtowardsopen-source,productionfilesystems 1983 AFS 1985 NFS

Bibliography III

Quinlan, S. and Dorward, S. (2002).Venti: a new approach to archival storage.In Proc. of the 1st USENIX Conf. on File and Storage Technologies (FAST’02),pages 89–102.

Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003).The Google file system.ACM SIGOPS Operating Systems Review, 37(5):29–43.

Schwan, P. (2003).Lustre: Building a file system for 1,000-node clusters.In Proc. of the 2003 Linux Symposium, pages 380–386.

Dorigo, A., Elmer, P., Furano, F., and Hanushevsky, A. (2005).XROOTD - a highly scalable architecture for data access.WSEAS Transactions on Computers, 4(4):348–353.

Weil, S. A. (2007).Ceph: reliable, scalable, and high-performance distributed storage.PhD thesis, University of California Santa Cruz.

24 / 18