cloud computing file storage systems

Download CLOUD Computing FILE      STORAGE SYSTEMS

Post on 23-Feb-2016




0 download

Embed Size (px)


CSE 726 Hot Topics in Cloud Computing. CLOUD Computing FILE STORAGE SYSTEMS . University at Buffalo. Sindhuja Venkatesh ( ). 21 Oct 2011. Overview. Google File system(GFS) IBM General Parallel File System(GPFS) Comparisons. Google File System. [3]. - PowerPoint PPT Presentation



CLOUD Computing FILE STORAGE SYSTEMS Sindhuja Venkatesh ( Oct 2011University at BuffaloCSE 726 Hot Topics in Cloud Computing

OverviewGoogle File system(GFS)

IBM General Parallel File System(GPFS)


Google File SystemIntroductionComponent failures are the normFiles are huge by traditional standardsModification to the files happens by appending Co-designing applications and API for file system

[3]Design OverviewSystem built from inexpensive components that fail often.System stores modest number of large filesTwo kinds of reads Large sequential writesEfficient support for concurrent appends.High sustained bandwidth as code of the day as opposed to low latency.


[3] [5]Architecture-contd.Client translates file name and byte offset to chunk index.Sends request to master.Master replies with chunk handle and location of replicas.Client caches this info.Sends request to a close replica, specifying chunk handle and byte range.Requests to master are typically buffered.

ChunksizeChunk size is chosen to be 64MB.Advantages of a large chunksizeLesser interaction between clients and masterReduced network overheadReduces size of metadata stored at masterDisadvantages Small files to single chunk become hot spots.Higher replication as a solution

MetadataThree major types :File and chunk namespacesMapping from files to chunksLocations of chunk replicasAll metadata is stored in memoryIn-Memory Data StructuresChunk LocationsOperation LogsConsistency Model Read

Consider a set of data modifications, and a set of reads all executed by different clients. Furthermore, assume that the reads are executed a sufficient time after the writes.Consistent if all clients see the same thing.Defined if all clients see the modification in its entirety (atomic).

Lease and Mutation Order - Write

Client asks master for all replicas.

Master replies. Client caches.

Client pre-pushes data to all replicas.

After all replicas acknowledge, client sends write request to primary.

Primary forwards write request to all replicas.

Secondary(s) signal completion.

Primary replies to client. Errors handled by retrying.Atomic Record AppendsSimilar to that of the previously mentioned leasing mutation methodClient pushes data to all replicas.Sends request to primary. PrimaryPads current chunk if necessary, telling client to retry.Writes data, tells replicas to do the same.Failures may cause record to be duplicated. These are handled by the client.Data may be different at each replica.

SnapshotCopy of a file or a directory tree at an instantUsed for Check pointing.Handled using copy-on-write.First revoke all leases.Then duplicate the metadata, but point to the same chunks.When a client requests a write, the master allocates a new chunk handle.

Master OperationNamespace Management and LockingReplica Placement Creation, Re-replication, RebalancingGarbage CollectionStale replica detection Fault Tolerance High AvailabilityFast recovery Chunk ReplicationMaster ReplicationData IntegrityGeneral Parallel File SystemIntroductionThe file system was fundamentally designed for high performance computing clusters.Traditional supercomputing file access involves: Parallel access from multiple nodes within a fileInterfile Parallel access (files in same dir)GPFS supports fully parallel access to both file data and metadata.Even administrative actions performed in parallel.

[1]GPFS Architecture

Achieves extreme scalability through shared-disk architecture.File system Nodes Cluster nodes File system and the applications that use it runEqual access to all disksSwitching FabricStorage area network (SAN)

Shared disks Files are striped all across the file system disks.

GPFS IssuesData striping and Allocation, Prefetch and Write-behind.Large files are divided into equal sized blocks and consecutive blocks are placed in different disks.256k block size.Prefetching the data into buffer pool. Large directory supportExtensible hashing for large directories file name lookupLogging and recoveryAll metadata updates are loggedAll nodes have logs for each file system it mounts.

Distributed locking vs. Centralized ManagementDistributed Locking: Every file system operation acquires an appropriate read or write lock to syn- chronize with conflicting operations on other nodes before reading or updating any file system data or metadata. Centralized Management: all conflicting operations are forwarded to a designated node, which per- forms the requested read or update.

Distributed Lock ManagerUses a centralized lock manager in conjunction with local lock managers in each file system node.The global lock manager coordinates locks between local lock managers by handing out lock tokens Repeated accesses to the same disk object from the same node only require a single message to obtain the right to acquire a lock on the object (the lock token). Only when an operation on another node requires a conflict- ing lock on the same object are additional messages necessary to revoke the lock token from the first node so it can be granted to the other node.

Parallel Data AccessCertain classes of supercomputer applications require writing to the same file from multiple nodes. GPFS uses byte-range locking to synchronize reads and writes to file data. Token given from (zero to infinity)Then limited based on the concurrent reads

Parallel Data Access

The measurements demonstrate how I/O throughput in GPFS scales when adding more file system nodes and more disks to the system The figure compares reading and writing a single large file from multiple nodes in parallel against each node reading or writing a different file. At 18 nodes the write throughput leveled off due to a problem in the switch adapter microcode. The other point to note in this figure is that writing to a single file from multiple nodes in GPFS was just as fast as each node writing to a different file, demonstrating the effectiveness of the byte-range token protocol described before.

Synchronizing access to MetadataLike other file systems, GPFS uses inodes and indirect blocks to store file attributes and data block addresses. Write operations in GPFS use a shared write lock on the inode that allows concurrent writers on multiple nodes. One of the nodes accessing the file is designated as the metanode for the file, only the metanode reads or writes the inode from or to disk. Each writer updates a locally cached copy of the inode and forwards its inode updates to the metanode periodically or when the shared write token is revoked by a stat() or read() operation on another node.The metanode for a particular file is elected dynami- cally with the help of the token server. When a node first accesses a file, it tries to acquire the metanode token for the file. The token is granted to the first node to do so; other nodes instead learn the identity of the metanode.

Allocation MapsThe allocation map records the allocation status (free or in-use) of all disk blocks in the file system. Since each disk block can be divided into up to 32 subblocks to store data for small files, the allocation map contains 32 bits per disk block as well as linked lists for finding a free disk block or a subblock of a particular size efficiently. For each GPFS file system, one of the nodes in the cluster is responsible for maintaining free space statistics about all allocation regions. This allocation manager node initializes free space statistics by reading the allocation map when the file system is mounted.

Token Manager ScalingThe token manager keeps track of all lock tokens granted to all nodes in the cluster. GPFS uses a number of optimizations in the token protocol that significantly reduce the cost of token management and improve response time as well. When it is necessary to revoke a token, it is the responsibility of the revoking node to send revoke messages to all nodes that are holding the token in a conflicting mode, to collect replies from these nodes, and to forward these as a single message to the token manager. Acquiring a token will never require more than two messages to the token manager, regardless of how many nodes may be holding the token in a conflicting mode. The protocol also supports token prefetch and token request batching, which allow acquiring multiple tokens in a single message to the token manager.

Fault ToleranceNode FailuresUpdated by other nodes containing the logsCommunication FailuresThe network is divided and access provided only to the group containing majority of nodesDisk FailuresDual attached RAID for redundancy. File Systems: Internet Services Vs. HPCIntroductionLeading Internet services have designed and implemented file systems from-scratch to provide high performance for their anticipated application workloads and usage scenarios. Leading examples of such Internet services file systems, as we will call them, include the Google file system (GoogleFS), Amazon Simple Storage Service (S3) and the open-source Hadoop distribute file system (HDFS). Another style of computing at a comparable scale and with a growing market place [24] is high performance computing (HPC). Like Internet applications, HPC applications are often data- intensive and run in parallel on large clusters (supercomputers). These applications use parallel file systems for highly scalable and concurrent storage I/O. Examples of parallel file systems include IBMs GPFS, Suns LustreFS, and the open source Parallel Virtual file system (PVFS).

[2] [4]Comparison

Experimental EvaluationImplemented a shim layer that uses Hadoops extensible abstract file system API (org.apache.hadoop.fs.FileSystem)


View more >