cloud computing file storage systems

39
CLOUD COMPUTING FILE STORAGE SYSTEMS Sindhuja Venkatesh ([email protected]) 21 Oct 2011 versity at Buffalo CSE 726 Hot Topics in Cloud Computing

Upload: jon

Post on 23-Feb-2016

56 views

Category:

Documents


0 download

DESCRIPTION

CSE 726 Hot Topics in Cloud Computing. CLOUD Computing FILE STORAGE SYSTEMS . University at Buffalo. Sindhuja Venkatesh ( [email protected] ). 21 Oct 2011. Overview. Google File system(GFS) IBM General Parallel File System(GPFS) Comparisons. Google File System. [3]. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CLOUD Computing FILE      STORAGE SYSTEMS

CLOUD COMPUTING FILE STORAGE SYSTEMS Sindhuja Venkatesh ([email protected])21 Oct 2011

University at Buffalo

CSE 726 Hot Topics in Cloud Computing

Page 2: CLOUD Computing FILE      STORAGE SYSTEMS

Overview Google File system(GFS)

IBM General Parallel File System(GPFS)

Comparisons

Page 3: CLOUD Computing FILE      STORAGE SYSTEMS

Google File SystemIntroduction Component failures are the norm Files are huge by traditional standards Modification to the files happens by

appending Co-designing applications and API for file

system

[3]

Page 4: CLOUD Computing FILE      STORAGE SYSTEMS

Design Overview System built from inexpensive

components that fail often. System stores modest number of large

files Two kinds of reads Large sequential writes Efficient support for concurrent appends. High sustained bandwidth as code of the

day as opposed to low latency.

Page 5: CLOUD Computing FILE      STORAGE SYSTEMS

Architecture[3] [5]

Page 6: CLOUD Computing FILE      STORAGE SYSTEMS

Architecture-contd. Client translates file name and byte

offset to chunk index. Sends request to master. Master replies with chunk handle and

location of replicas. Client caches this info. Sends request to a close replica,

specifying chunk handle and byte range.

Requests to master are typically buffered.

Page 7: CLOUD Computing FILE      STORAGE SYSTEMS

Chunksize Chunk size is chosen to be 64MB. Advantages of a large chunksize

• Lesser interaction between clients and master

• Reduced network overhead• Reduces size of metadata stored at master

Disadvantages • Small files to single chunk become hot

spots.• Higher replication as a solution

Page 8: CLOUD Computing FILE      STORAGE SYSTEMS

Metadata Three major types :

File and chunk namespacesMapping from files to chunksLocations of chunk replicas

All metadata is stored in memory• In-Memory Data Structures• Chunk Locations• Operation Logs

Page 9: CLOUD Computing FILE      STORAGE SYSTEMS

Consistency Model– Read

• Consider a set of data modifications, and a set of reads all executed by different clients. Furthermore, assume that the reads are executed a “sufficient” time after the writes.Consistent if all clients see the same thing.Defined if all clients see the modification in its entirety (atomic).

Page 10: CLOUD Computing FILE      STORAGE SYSTEMS

Lease and Mutation Order - Write

1. Client asks master for all replicas.

2. Master replies. Client caches.

3. Client pre-pushes data to all replicas.

4. After all replicas acknowledge, client sends write request to primary.

5. Primary forwards write request to all replicas.

6. Secondary(s) signal completion.

7. Primary replies to client. Errors handled by retrying.

Page 11: CLOUD Computing FILE      STORAGE SYSTEMS

Atomic Record Appends Similar to that of the previously

mentioned leasing mutation method Client pushes data to all replicas. Sends request to primary. Primary

Pads current chunk if necessary, telling client to retry.

Writes data, tells replicas to do the same. Failures may cause record to be

duplicated. These are handled by the client. Data may be different at each replica.

Page 12: CLOUD Computing FILE      STORAGE SYSTEMS

Snapshot Copy of a file or a directory tree at an

instant Used for Check pointing. Handled using copy-on-write.

• First revoke all leases.• Then duplicate the metadata, but point to the

same chunks.• When a client requests a write, the master

allocates a new chunk handle.

Page 13: CLOUD Computing FILE      STORAGE SYSTEMS

Master Operation Namespace Management and Locking Replica Placement Creation, Re-replication, Rebalancing Garbage Collection Stale replica detection

Page 14: CLOUD Computing FILE      STORAGE SYSTEMS

Fault Tolerance High Availability

Fast recovery Chunk Replication Master Replication

Data Integrity

Page 15: CLOUD Computing FILE      STORAGE SYSTEMS

General Parallel File SystemIntroduction The file system was fundamentally

designed for high performance computing clusters.

Traditional supercomputing file access involves:

Parallel access from multiple nodes within a file

Interfile Parallel access (files in same dir) GPFS supports fully parallel access to

both file data and metadata. Even administrative actions performed in

parallel.

[1]

Page 16: CLOUD Computing FILE      STORAGE SYSTEMS

GPFS Architecture Achieves extreme scalability

through shared-disk architecture.

File system Nodes • Cluster nodes • File system and the

applications that use it run

• Equal access to all disks Switching Fabric

• Storage area network (SAN)

Shared disks • Files are striped all across

the file system disks.

Page 17: CLOUD Computing FILE      STORAGE SYSTEMS

GPFS Issues Data striping and Allocation, Prefetch and Write-

behind.Large files are divided into equal sized blocks and

consecutive blocks are placed in different disks.256k block size.Prefetching the data into buffer pool.

Large directory supportExtensible hashing for large directories file name lookup

Logging and recoveryAll metadata updates are loggedAll nodes have logs for each file system it mounts.

Page 18: CLOUD Computing FILE      STORAGE SYSTEMS

Distributed locking vs. Centralized Management

Distributed Locking: Every file system operation acquires an appropriate read or write lock to syn- chronize with conflicting operations on other nodes before reading or updating any file system data or metadata.

Centralized Management: all conflicting operations are forwarded to a designated node, which per- forms the requested read or update.

Page 19: CLOUD Computing FILE      STORAGE SYSTEMS

Distributed Lock Manager Uses a centralized lock manager in conjunction with

local lock managers in each file system node. The global lock manager coordinates locks between

local lock managers by handing out lock tokens Repeated accesses to the same disk object from the

same node only require a single message to obtain the right to acquire a lock on the object (the lock token).

Only when an operation on another node requires a conflict- ing lock on the same object are additional messages necessary to revoke the lock token from the first node so it can be granted to the other node.

Page 20: CLOUD Computing FILE      STORAGE SYSTEMS

Parallel Data Access Certain classes of supercomputer

applications require writing to the same file from multiple nodes.

GPFS uses byte-range locking to synchronize reads and writes to file data. • Token given from (zero to infinity)• Then limited based on the concurrent reads

Page 21: CLOUD Computing FILE      STORAGE SYSTEMS

Parallel Data Access The measurements

demonstrate how I/O throughput in GPFS scales when adding more file system nodes and more disks to the system

The figure compares reading and writing a single large file from multiple nodes in parallel against each node reading or writing a different file.

At 18 nodes the write throughput leveled off due to a problem in the switch adapter microcode.

The other point to note in this figure is that writing to a single file from multiple nodes in GPFS was just as fast as each node writing to a different file, demonstrating the effectiveness of the byte-range token protocol described before.

Page 22: CLOUD Computing FILE      STORAGE SYSTEMS

Synchronizing access to Metadata Like other file systems, GPFS uses inodes and indirect

blocks to store file attributes and data block addresses. Write operations in GPFS use a shared write lock on the

inode that allows concurrent writers on multiple nodes. One of the nodes accessing the file is designated as the

metanode for the file, only the metanode reads or writes the inode from or to disk.

Each writer updates a locally cached copy of the inode and forwards its inode updates to the metanode periodically or when the shared write token is revoked by a stat() or read() operation on another node.

The metanode for a particular file is elected dynami- cally with the help of the token server. When a node first accesses a file, it tries to acquire the metanode token for the file. The token is granted to the first node to do so; other nodes instead learn the identity of the metanode.

Page 23: CLOUD Computing FILE      STORAGE SYSTEMS

Allocation Maps The allocation map records the allocation status

(free or in-use) of all disk blocks in the file system.

Since each disk block can be divided into up to 32 subblocks to store data for small files, the allocation map contains 32 bits per disk block as well as linked lists for finding a free disk block or a subblock of a particular size efficiently.

For each GPFS file system, one of the nodes in the cluster is responsible for maintaining free space statistics about all allocation regions. This allocation manager node initializes free space statistics by reading the allocation map when the file system is mounted.

Page 24: CLOUD Computing FILE      STORAGE SYSTEMS

Token Manager Scaling The token manager keeps track of all lock tokens granted to all

nodes in the cluster. GPFS uses a number of optimizations in the token protocol that

significantly reduce the cost of token management and improve response time as well.

When it is necessary to revoke a token, it is the responsibility of the revoking node to send revoke messages to all nodes that are holding the token in a conflicting mode, to collect replies from these nodes, and to forward these as a single message to the token manager.

Acquiring a token will never require more than two messages to the token manager, regardless of how many nodes may be holding the token in a conflicting mode.

The protocol also supports token prefetch and token request batching, which allow acquiring multiple tokens in a single message to the token manager.

Page 25: CLOUD Computing FILE      STORAGE SYSTEMS

Fault Tolerance Node Failures

Updated by other nodes containing the logs Communication Failures

The network is divided and access provided only to the group containing majority of nodes

Disk FailuresDual attached RAID for redundancy.

Page 26: CLOUD Computing FILE      STORAGE SYSTEMS

File Systems: Internet Services Vs. HPC

Introduction Leading Internet services have designed and implemented file

systems “from-scratch” to provide high performance for their anticipated application workloads and usage scenarios.

Leading examples of such Internet services file systems, as we will call them, include the Google file system (GoogleFS), Amazon Simple Storage Service (S3) and the open-source Hadoop distribute file system (HDFS).

Another style of computing at a comparable scale and with a growing market place [24] is high performance computing (HPC). Like Internet applications, HPC applications are often data- intensive and run in parallel on large clusters (supercomputers). These applications use parallel file systems for highly scalable and concurrent storage I/O.

Examples of parallel file systems include IBM’s GPFS, Sun’s LustreFS, and the open source Parallel Virtual file system (PVFS).

[2] [4]

Page 27: CLOUD Computing FILE      STORAGE SYSTEMS

Comparison

Page 28: CLOUD Computing FILE      STORAGE SYSTEMS

Experimental Evaluation Implemented a shim layer that uses Hadoop’s

extensible abstract file system API (org.apache.hadoop.fs.FileSystem) to use PVFS for all file I/O operations.

Hadoop directs all file system operations to the shim layer that forwards each request to the PVFS user-level library. This implementation does not make any code changes to PVFS other than one configuration change, increasing the default 64KB stripe size to match the HDFS chunk size of 64MB, during PVFS setup.

Page 29: CLOUD Computing FILE      STORAGE SYSTEMS

Experimental Evaluation Contd.

Page 30: CLOUD Computing FILE      STORAGE SYSTEMS

Experiment- Contd.. The shim layer has three key components that are used by Hadoop applications.

Readahead buffering – While applications can be programmed to request data in any size, the Hadoop framework uses 4KB as the default amount of data accessed in each file system call. Instead of performing such small reads, HDFS prefetches the entire chunk (of default size 64MB)

Data layout module – The Hadoop/Mapreduce job scheduler distributes computation tasks across many nodes in the cluster. Although not mandatory, it prefers to assign tasks to those nodes that store input data required for that task. This requires the Hadoop job scheduler to be aware of the file’s layout information. Fortunately, as a parallel file system, PVFS has this information at the client, and exposes the file striping layout as an extended attribute of each file. Our shim layer matches the HDFS API for the data layout by querying the appropriate extended attributes as needed.

Replication emulator – Although the public release of PVFS does not support triplication, our shim enables PVFS to emulate HDFS-style replication by writing, on behalf of the client, to three data servers with every application write. Note that it is the client that sends the three write requests to different servers, unlike HDFS which uses pipelining among its servers. Our approach was motivated by the simplicity of emulating replication at the client instead of making non-trivial changes to the PVFS server implementation. Planned work in PVFS project includes support for replication techniques

Page 31: CLOUD Computing FILE      STORAGE SYSTEMS

Experimental Setup Experiments were performed on two clusters. A small cluster for microbenchmarks : SS cluster, consists of 20 nodes, each containing a dual-core 3GHz Pentium D processor, 4GB of memory, and one 7200 rpm SATA 180 GB Seagate Barracuda disk with 8MB buffer DRAM size. Nodes are directly connected to a HP Procurve 2848 using Gigabit Ethernet backplane and have 100 μsecond node to node latency. All machines run the Linux 2.6.24.2 kernel (Debian release) and use the ext3 file system to manage its disk. A big cluster for running real time applications: or large scale testing, we use the Yahoo! M45 cluster, a 4000-core cluster used to experiment with ideas in data-intensive scalable computing . It makes available about 400 nodes, of which we typically use about 50-100 at a time, each containing two quad-core 1.86GHz Xeon processors, 6GB of memory, and four 7200 rpm SATA 750 GB Seagate Barracuda ES disk with 8MB buffer DRAM size. Because of the configuration of these nodes, only one disk is used for a PVFS I/O server. Nodes are interconnected using a Gigabit Ethernet switch hierarchy. All machines run the Redhat Enterprise Linux Server OS (release 5.1) with the 2.6.18-53.1.13.el5 kernel and use the ext3 file system to manage its disks.

Page 32: CLOUD Computing FILE      STORAGE SYSTEMS

Results

Page 33: CLOUD Computing FILE      STORAGE SYSTEMS

Results – Micro Benchmarks

Page 34: CLOUD Computing FILE      STORAGE SYSTEMS

Results – Micro Benchmarks Contd.

Page 35: CLOUD Computing FILE      STORAGE SYSTEMS

Results – Micro Benchmarks Contd.

Page 36: CLOUD Computing FILE      STORAGE SYSTEMS

Performance of Real Time Apps.

Page 37: CLOUD Computing FILE      STORAGE SYSTEMS

Conclusion and Future Work This paper explores the relationship between modern parallel file

systems, represented by PVFS, and purpose-built Internet services file systems, represented by HDFS, in the context of their design and performance. It is shown that PVFS can perform comparable to HDFS in the Hadoop Internet services stack.

The biggest difference between PVFS and HDFS is the redundancy scheme for handling failures.

On balance, it is believed that parallel file systems could be made available for use in Hadoop, while delivering promising performance for diverse access patterns. These services can benefit from parallel file system specializations for concurrent writing, faster metadata and small file operations. With a range of parallel file systems to choose from, Internet services can select a system that better integrates their local data management tools.

In future, we can plan to investigate the “opposite” direction; that is, how could we use Internet services file systems for HPC applications.

Page 38: CLOUD Computing FILE      STORAGE SYSTEMS

References[1] GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck and Roger Haskin,IBM Almaden Research Center San Jose, CA Proceedings of the Conference on File and Storage Technologies (FAST’02), 28–30 January 2002, Monterey, CA, pp. 231–244. (USENIX, Berkeley, CA.)[2] Data-intensive file systems for Internet services: A rose by any other name …Wittawat Tantisiriroj Swapnil Patil Garth Gibson{wtantisi, swapnil.patil , garth.gibsong} @ cs.cmu.eduCMU-PDL-08-114,October 2008,Parallel Data Laboratory,CarnegieMellonUniversity,Pittsburgh, PA 15213-3890URL: http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/hdfspvfs-tr-08.pdf[3] The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung,Google∗19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003.[4] HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical WorkloadsAzza Abouzeid, Kamil Bajda-Pawlikowski,Daniel Abadi,Avi Silberschatz,Alexander Rasin Yale University,Brown University {azza,kbajda,dna,avi}@cs.yale.edu; [email protected], Published in:JournalProceedings of the VLDB Endowment VLDB Endowment Hompage archive Volume 2 Issue 1, August 2009[5] .Data-Intensive Text Processing with MapReduce ,Jimmy Lin and Chris DyerUniversity of Maryland, College Park. ,URL : http://www.umiacs.umd.edu/~jimmylin/MapReduce-book-final.pdf

Page 39: CLOUD Computing FILE      STORAGE SYSTEMS

THANK YOU!!