cis 620 advanced operating systems lecture 11 – distributed file systems, consistency and...

CIS 620 Advanced Operating

Systems

Lecture 11 – Distributed File Systems, Consistency and

Replication

Prof. Timothy Arndt

BU 331

Distributed File Systems

• File service vs. file server The file service is the specification. A file server is a process running on a machine

to implement the file service for (some) files on that machine.

In a normal distributed system would have one file service but perhaps many file servers.• If have very different kinds of file systems we might

not be able to have a single file service as perhaps some functions are not available.


• File Server Design File

• Sequence of bytes Unix MS-Dos Windows

• Sequence of Records Mainframes Keys We do not cover these file systems. They are often

discussed in database courses.


• File attributes rwx and perhaps a (append)

• This is really a subset of what is called ACL -- access control list or Capability.

• You get ACLs and Capabilities by reading columns and rows of the access matrix.

owner, group, various dates, size dump, autocompress, immutable


• Upload/download vs. remote access. Upload/download means the only file services

supplied are read file and write file.• All modifications done on a local copy of file.• Conceptually simple at first glance.• Whole file transfers are efficient (assuming you are

going to access most of the file) when compared to multiple small accesses.

• Not an efficient use of bandwidth if you access only a small part of a large file.

• Requires storage on client.


• What about concurrent updates? What if one client reads and "forgets" to write for a long

time and then writes back the "new" version overwriting newer changes from others?

• Remote access means direct individual reads and writes to the remote copy of the file. File stays on the server. Issue of (client) buffering

• Good to reduce number of remote accesses.• But what about semantics when a write occurs?


Note that meta-data is written for a read so if you want faithful semantics every client read must modify metadata on server or all requests for metadata (e.g ls or dir commands) must go to server.

• Cache consistency question.

• Directories Mapping from names to files/directories. Contains rules for names of files and

(sub)directories. Hierarchy i.e. tree

• (hard) links


• With hard links the filesystem becomes a Directed Acyclic Graph instead of a simple tree.

• Symbolic links Symbolic not symmetric. Indeed asymmetric. Consider

cd ~

mkdir dir1

touch dir1/file1

ln -s dir1/file1 file2


file2 has a new inode it is a new type of file called a symlink and its "contents" are the name of the file dir/file1

When accessed file2 returns the contents of file1, but it is not equal to file1.• If file1 is deleted, file2 "exists" but is invalid.• If a new file2 is created, file2 now points to it.

Symbolic links can point to directories as well. With symbolic links pointing to directories, the

file system becomes a general graph, i.e. directed cycles are permitted.


Imagine hard links pointing to directories (Unix does not permit this).

cd ~mkdir B; mkdir Cmkdir B/D; mkdir B/Eln B B/D/oh-my

Now you have a loop with honest looking links. Normally you can't remove a directory (i.e.

unlink it from its parent) unless it is empty.• But when can have multiple hard links to a

directory, you should permit removing (i.e. unlinking) one even if the directory is not empty.


So in the above example you could unlink B from A.

Now you have garbage (unreachable, i.e. unnamable) directories B, D, and E.

For a centralized system you need a conventional garbage collection.

For distributed system you need a distributed garbage collector, which is much harder.

• Transparency Location transparency

• Path name (i.e. full name of file) does not say where the file is located.


Location Independence• Path name is independent of the server. Hence you

can move a file from server to server without changing its name.

• Have a namespace of files and then have some (dynamically) assigned to certain servers. This namespace would be the same on all machines in the system.

Root transparency• made up name• / is the same on all systems• This would ruin some conventions like /tmp


• Examples Machine + path naming

• /machine/path• machine:path

Mounting remote file system onto local hierarchy

When done intelligently we get location transparency

Single namespace looking the same on all machines


• Two level naming We said above that a directory is a mapping

from names to files (and subdirectories). More formally, the directory maps the user

name /home/me/class-notes.html to the OS name for that file 143428 (the Unix inode number).

These two names are sometimes called the symbolic and binary names.

For some systems the binary names are available.


The binary name could contain the server name so that could directly reference files on other filesystems/machines• Unix doesn't do this

We could have symbolic names contain the server name• Unix doesn't do this either• VMS did something like this. Symbolic name was

something like nodename::filename Could have the name lookup yield multiple

binary names.


• Redundant storage of files for availability• Naturally must worry about updates

When visible? Concurrent updates?

• Whenever you hear of a system that keeps multiple copies of something, an immediate question should be "are these immutable?". If the answer is no, the next question is "what are the update semantics?”

• Sharing semantics Unix semantics - A read returns the value

stored by the last write.


• Actually Unix doesn't quite do this. If a write is large (several blocks), do seeks for each During a seek, the process sleeps (in the kernel) Another process can be writing a range of blocks that

intersects the blocks for the first write. The result could be (depending on disk scheduling), that

the result does not have a last write.• Perhaps Unix semantics means - A read returns the

value stored by the last write, providing one exists.• Perhaps Unix semantics means - A write syscall

should be thought of as a sequence of write-block syscalls and similar for reads. A read-block syscall returns the value of the last write-block syscall for that block


Easy to get this same semantics for systems with file servers providing• No client side copies (Upload/download)• No client side caching

Session semantics• Changes to an open file are visible only to the

process (machine???) that issued the open. When the file is closed the changes become visible to all.

• If you are using client caching you cannot flush dirty blocks until close. (What if you run out of buffer space?)


May mess up file-pointer semantics• The file pointer is shared across the fork so all the

children of a parent share it.• But if the children run on another machine with

session semantics, the file pointer can't be shared since the other machine does not see the effect of the writes done by the parent).

• Immutable files Then there is "no problem” Fine if you don't want to change anything


Can have "version numbers"• Old version may become inaccessible (at least under

the current name)• With version numbers if you use name without

number you get the highest numbered version • But really you do have the old (full) name

accessible VMS definitely did this

• Note that directories are still mutable• Otherwise no create-file is possible


• Distributed File System Implementation File Usage characteristics Measured under Unix at a university

• Not obvious that the same results would hold in a different environment

Findings• 1. Most files are small (< 10K)• 2. Reading dominates writing • 3. Sequential accesses dominate • 4. Most files have a short lifetime


• 5. Sharing is unusual• 6. Most processes use few files• 7. File classes with different properties exist

Some conclusions• 1 suggests whole-file transfer may be worthwhile

(except for really big files).• 2+5 suggest client caching and dealing with

multiple writers somehow, even if the latter is slow (since it is infrequent).

• 4 suggests doing creates on the client


• Not so clear. Possibly the short lifetime files are temporaries that are created in /tmp or /usr/tmp or /somethingorother/tmp. These would not be on the server anyway.

• 7 suggests having multiple mechanisms for the several classes.

• Implementation choices Servers & clients homogeneous?

• Common Unix+NFS: any machine can be a server and/or a client


• User-mode implementation: Servers for files and directories are user programs so can configure some machines to offer the services and others not to

Fundamentally different: Either the hardware or software is fundamentally different for clients and servers.

In Unix some server code is in the kernel but other code is a user program (run as root) called nfsd

File and directory servers together?


• If yes, less communication• If no, more modular "cleaner”

Looking up a/b/c/ when a a/b a/b/c on different servers• Natural solution is for server-a to return name of

server-a/b• Then client contacts server-a/b gets name of server-

a/b/c etc.• Alternatively server-a forwards request to server-a/b

who forwards to server-a/b/c.• Natural method takes 6 communications (3 RPCs)


• Alternative is 4 communications but is not RPC Name caching

• The translation from a/b/c to the inode (i.e. symbolic to binary name) is expensive even for centralized systems.

• Called namei in Unix and was once measured to be a significant percentage of all of kernel activity.

• Later Unix added "namei caching"• Potentially an even greater time saver for distributed

systems since communication is expensive.• Must worry about obsolete entries.


• Stateless vs. Stateful Should the server keep information between

requests from a user, i.e. should the server maintain state?

What state?• Recall that the open returns an integer called a file

descriptor that is subsequently used in read/write.• With a stateless server, the read/write must be self

contained, i.e. cannot refer to the file descriptor.• Why?


Advantages of stateless• Fault tolerant - No state to be lost in a crash• No open/close needed (saves messages)• No space used for tables (state requires storage)• No limit on number of open files (no tables to fill

up)• No problem if client crashes (no state to be confused

by)

Advantages of stateful• Shorter read/write (descriptor shorter than name)


• Better performance• Since we keep track of what files are open, we know

to keep those inodes in memory But stateless could keep a memory cache of inodes as well

(evict via LRU instead of close, not as good)

• Blocks can be read in advance (read ahead) Of course stateless can read ahead.

• Difference is that with stateful we can better decide when accesses are sequential.

• Idempotency easier (keep sequence numbers)• File locking possible (the lock is state)

Stateless can write a lock file by convention. Stateless can call a lock server

Caching

• There are four places to store a file supplied by a file server (these are not mutually exclusive): Server's disk

• always done Server's main memory

• normally done• Standard buffer cache• Clear performance gain• Little if any semantics problems

Caching

Client's main memory• Considerable performance gain• Considerable semantic considerations • The one we will study

Client’s disk• Not so common now with cheaper memory

Unit of caching• File vs. block• Tradeoff of fewer access vs. storage efficiency

Caching

What eviction algorithm?• Exact LRU feasible because we can afford the time

to do it (via linked lists) since access rate is low.

• Where in client's memory to put cache? The user's process

• The cache will die with the process• No cache reuse among distinct processes• Not done for normal OS.• Big deal in databases

Cache management is a well studied DB problem

Caching

The kernel (i.e. the client's kernel)• System call required for cache hit• Quite common

Another process • "Cleaner" than in kernel• Easier to debug• Slower• Might get paged out by kernel!

• Cache consistency Big question

Caching

Write-through• All writes are sent to the server (as well as the client

cache)• Hence does not lower traffic for writes• Does not by itself fix values in other caches• We need to invalidate or update other caches

Can have the client cache check with server whenever supplying a block to ensure that the block is not obsolete

Hence still need to reach server for all accesses but at least the reads that hit in the cache only need to send tiny message (timestamp not data).

Caching

Delayed write• Wait a while (30 seconds is used in some NFS

implementations) and then send a bulk write message.

• This is more efficient than a bunch of small write messages.

• If file is deleted quickly, you might never write it.• Semantics are now time dependent (and ugly).

Caching

• Write on close Session semantics

• Fewer messages since more writes than closes.• Not beautiful (think of two files simultaneously

opened).• Not much worse than normal (uniprocessor)

semantics. The difference is that it (appears) to be much more likely to hit the bad case.

Delayed write on close• Combines the advantages and disadvantages of

delayed write and write on close.

Caching

Doing it "right”.• Multiprocessor caching (of central memory) is well

studied and many solutions are known.• Use cache consistency (a.k.a. cache coherence)

methods which are well-known.• Centralized solutions are possible.

But none are cheap.• Perhaps NSF is good enough and not enough reason

to change (NFS predates cache coherence work).

Replication

• Some issues are similar to (client) caching.• Why?

Because whenever you have multiple copies of anything, ask• Are they immutable?• What is update policy?• How do you keep copies consistent?

• Purposes of replication Reliability

• A "backup" is available if data is corrupted on one server.

Replication

Availability• Only need to reach any of the servers to access the

file (at least for queries).• Not the same as reliability

Performance• Each server handles less than the full load (for a

query-only system much less).• Can use closest server - lowering network delays.• Not important for distributed system on one physical

network.• Very important for web mirror sites.

Replication

• Transparency If we can't tell files are replicated, we say the

system has replication transparency Creation can be completely opaque

• i.e. fully manual• users use copy commands• if directory supports multiple binary names for a

single symbolic name, use this when making copies presumably subsequent opens will try the binary names in

order (so they are not opaque)

Replication

Creation can use lazy replication.• User creates original

system later makes copies subsequent opens can be (re)directed at any copy

Creation can use group communication.• User directs requests at a group.• Hence creation happens to all copies in the group at

once.

Replication

• Update protocols Primary copy

• All updates are done to the primary copy.• This server writes the update to stable storage and

then updates all the other (secondary) copies.• After a crash, the server looks at stable storage and

sees if there are any updates to complete.• Reads are done from any copy.• This is good for reads (read any one copy).• Writes are not so good.

Can't write if primary copy is unavailable.

Replication

• Semantics The update can take a long time (some of the secondaries

can be down) While the update is in progress, reads are concurrent with

it. That is you might get old or new value depending which copy they read.

Voting• All copies are equal (symmetric)• To write you must write at least WQ of the copies (a

write quorum). Set the version number of all these copies to 1 + max of current version numbers.

• To read you must read at least RQ copies and use the value with the highest version.

Replication

• Require WQ+RQ > number copies Hence any write quorum and read quorum intersect. Hence the highest version number in any read quorum is

the highest ver number there is. Hence always read the current version

Consider extremes (WQ=1 and RQ=1) Fine points

• To write, you must first read all the copies in your WQ to get the version number.

• Must prevent races• Let N=2, WQ=2, RQ=1. Both copies (A and B)

have version number 10.

Replication

• Two updates start. U1 wants to write 1234, U2 wants to write 6789.

• Both read version numbers and add 1 (get 11).• U1 writes A and U2 writes B at roughly the same

time.• Later U1 writes B and U2 writes A.• Now both are at version 11 but A=6789 and

B=1234. Voting with ghosts

• Often reads dominate writes so we choose RQ=1 (or at least RQ very small so WQ very large).

Replication

• This makes it hard to write. E.g. RQ=1 so WQ=n and hence can't update if any machine is down.

• When one detects that a server is down, a ghost is created.

• Ghost cannot participate in read quorum, but can in write quorum

write quorum must have at least one non-ghost

• Ghost throws away value written to it• Ghost always has version 0 • When crashed server reboots, it accesses a read

quorum to update its value

Structured Peer-to-Peer Systems

• Balancing load in a peer-to-peer system by replication.

Handling Byzantine Failures

• The different phases in Byzantine fault tolerance.

High Availability in Peer-to-Peer Systems

• The ratio rrep /rec as a function of node availability a.

NFS

• NFS - Sun Microsystems's Network File System. "Industry standard", dominant system. Machines can be (and often are) both clients

and servers. Basic idea is that servers export directories and

clients mount them.• When server exports a directory, the subtree routed

there is exported.• In Unix exporting is specified in /etc/exports

NFS

• In Unix mounting is specified in /etc/fstab fstab = file system table. In Unix w/o NFS what you mount are filesystems.

• Two Protocols Mounting

• Client sends server message containing pathname (on server) of the directory it wishes to mount.

• Server returns handle for the directory Subsequent read/write calls use the handle Handle has data giving disk, inode #, et al Handle is not an index into table of actively exported

directories. Why not?

NFS

Because the table would be state and NFS is stateless. Can do this mounting at any time, often done at client boot time.

• Automounting File and directory access

• Most Unix system calls supported• Open/close not supported

NFS is stateless• Do have lookup, which returns a file handle. But

this handle is not an index into a table. Instead it contains the data needed.

• As indicated previously, the stateless nature of NFS makes Unix locking semantics hard to achieve.

NFS

Authentication• Client gives the rwx bits to server.• How does server know the client is machine it

claims to be?• Various Cryptographic keys.• This and other stuff stored in NIS (net info service)

a.k.a. yellow pages• Replicate NIS• Update master copy

master updates slaves window of inconsistency

NFS

Implementation• Client system call layer processes I/O system calls

and calls the virtual file system layer (VFS).• VFS has a v-node (virtual i-node) for each open file• For local files v-node points to i-node in kernel• For remote files v-node points to r-node (remote i-

node) in NFS client code. Blow by blow

• Mount (remote directory, local directory)• First the mount program goes to work

Contact the server and obtains a handle for the remote directory.

NFS

• Makes mount system call passing handle• Now the kernel takes over

Makes a v-node for the remote directory Asks client code to construct an r-node have v-node point to r-node

• Open system call• While parsing the name of the file, the kernel (VFS

layer) hits the local directory on which the remote is mounted (this part is similar to ordinary mounts of local filesystems).

• Kernel gets v-node of the remote directory (just as would get i-node if processing local files)

NFS

• Kernel asks client code to open the file (given r-node)

• Client code calls server code to look up remaining portion of the filename

• Server does this and returns a handle (but does not keep a record of this). Presumably the server, via the VFS and local OS, does an open and this data is part of the handle. So the handle gives enough information for the server code to determine the v-node on the server machine.

NFS

• When client gets a handle for the remote file, it makes an r-node for it. This is returned to the VFS layer, which makes a v-node for the newly opened remote file. This v-node points to the r-node. The latter contains the handle information.

• The kernel returns a file descriptor, which points to the v-node.

Read/write• VFS finds v-node from the file descriptor it is given.• Realizes remote and asks client code to do the

read/write on the given r-node (pointed to by the v-node).

NFS

• Client code gets the handle from its r-node table and contacts the server code.

• Server verifies the handle is valid (perhaps using authentication) and determines the v-node.

• VFS (on server) called with the v-node and the read/write is performed by the local (on server) OS.

• Read ahead is implemented but as stated before it is primitive (always read ahead).

Caching• Servers cache but not big deal• Clients cache

NFS

• Potential problems of course so• Discard cached entries after some seconds• On open the server is contacted to see when file last

modified. If it is newer than the cached version, the cached version is discarded.

• After some seconds all dirty cache blocks are flushed back to server.

• All these Band-Aids still do not give proper semantics (or even Unix semantics).

NFS

Lessons learned (from AFS, but applies in some generality)• Workstations, i.e. clients, have cycles to burn

So do as much as possible on client• Cache whenever possible• Exploit usage properties

Several classes of files (e.g. temporary) Trades off simplicity for efficiency

• Minimize system wide knowledge and change Helps scalability Favors hierarchies

NFS

• Trust fewest possible entities Try not to depend on the "kindness of strangers"

• Batch work where possible

cis 620 advanced operating systems lecture 11 – distributed file systems, consistency and...

Documents

file dirfile1

file services

file servers

file transfers

large file

file stays

distributed file systems

single file service