large scale sharing
DESCRIPTION
Large Scale Sharing. The Google File System PAST: Storage Management & Caching. – Presented by Chi H. Ho. Introduction. A next step from network file systems. How large? GFS: > 1000 storage nodes > 300 TB disk storage Hundreds of client machines PAST: Internet-scale. - PowerPoint PPT PresentationTRANSCRIPT
Large Scale Sharing
The Google File System
PAST: Storage Management & Caching
– Presented by Chi H. Ho
Introduction
A next step from network file systems. How large?
GFS: > 1000 storage nodes > 300 TB disk storage Hundreds of client machines
PAST: Internet-scale
The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Goals
Performance Scalability Reliability Availability Highly tuned for:
Google’s back-end file service Workloads: multiple-producer/single-consumer,
many-way merging
Assumptions
H/W: inexpensive components that often fail. Files: modest number of large files. Reads/Writes: 2 kinds
Large streaming: common case => optimized. Small random: supported but need not be efficient.
Concurrency: hundreds of concurrent appends. Performance: high sustained bandwidth is more
important than low latency.
Interface
Usual operations: create, delete, open, close, read, and write.
GFS specific: snapshot: creates a copy of a file or a directory
tree at low cost. record append: allows concurrent appends to
perform atomically.
Architecture
Architecture
User-level process
User-level process
User-level process
User-level process
Architecture (Files)
Files are divided into fixed-size chunks, each is replicated at multiple (default 3) chunkservers as a Linux file.
Each chunk is identified by an immutable and globally unique chunk handle assigned by the master at the time of chunk creation.
Read/Write data chunk specified by <chunk handle, byte range>
Architecture (Master)
Maintains metadata:•Namespace•Access control•Mapping files chunks•Chunks’ locations
Controls system-wide activities:•Chunk lease mamagement•Garbage collection•Chunk migration•And Heartbeat messages
Architecture (Client)Interacts
with Master for
metadata
Communicates directly with
chunkservers for data
Architecture (Notes)
No data cache is needed: Why?
• Client: ???
• Chunkservers: ???
Architecture (Notes)
No data cache is needed: Why?
• Client: most applications stream through huge files or have working sets too large to be cached.
• Chunkservers: already have Linux cache.
Single Master
Bottleneck?
Single point of failure?
Single Master
Bottleneck? Never read/write data thru. the master Only ask the master for chunks’ locations Prefetch multiple chunks Cache
Single point of failure? Master’s state is replicated on multiple machines. Mutations of master’s state are atomic. “Shadow” masters are temporarily used for read.
Chunk Size
Large: 64 MBs. Advantages:
Reduces client-master interaction. Reduces network overhead (use persistent TCP). Reduces size of metadata => kept in memory.
Disadvantages: Small files (small #chunks) may become hot spots.
Solutions: Small files => more replicas. Read from clients.
Metadata
Three major types: file and chunk namespaces, file-to-chunk mapping, locations of each chunk’s replicas.
}in master’s memory
Persistence issues: Namespaces and mapping: operation log stored
on multiple machines. Chunks’ locations: polled when master starts and
chunkservers joining, update by heartbeat msgs.
Operation Log
In the heart of GFS: The only persistent record of metadata, The logical time line that orders concurrent ops.
Operations are atomically committed. Recovery of master’s state is done by
replaying operations in the log.
Consistency
Metadata: solely controlled by the master Data: consistent after successful mutations.
Same order of mutations is applied on all replicas. Stale replicas (missing some mutations) are
detected and eliminated.
Consistent and clients see what the mutation
writes in its entirety
Clients see same data regardless which replica
Leases and Mutation Order
Lease: high-level chunk-based access control mechanism, granted by the master.
Global mutation order = lease grant order + serial number within a lease, chosen by the primary (lease holder).
Illustration of a mutation
ask for the lease holder of
the chunk
locations of primary and secondary
replicas
locate the lease or grant one if none exists.
cache the locations
push data to all replicas
store data in LRU buf.
and ack.
wait for all to ack.
write request
forward write request
assign serial no.
to request
request completed
reply (may be w/ errors)
Special Ops Revisited
Atomic Record Appends Master chooses offset. Up on failure: pad the failed replica(s), then retry. Guarantee: the record is appended to the file at
least once atomically. Snapshot
Copy-on-write. Used to make a copy of a file/directory quickly.
Master Operations
Namespace Management and Locking, To support concurrent master’s operations.
Replica Placement, To avoid dependent failures; to exploit network bandwidth.
Creation, Re-replication, Rebalancing, To better disk utilization, load balancing, fault tolerance.
Garbage Collection, Lazy deletion: simple, efficient, and support undelete.
Stale Replica Detection To eliminate obsolete replicas => garbage collected.
Fault Tolerance Sum Up
Master fails? Chunkservers fail? Disks corrupted? Network noise?
Micro-benchmarks
Configuration: 1 master, 2 replicas 16 chunkservers 16 clients Each machine: dual 1.4GHz PIII, 2GB mem,
2x80GB 5400rpm, full-duplex 100Mbps NIC.
}1 switch
1 switch
1Gbps
Micro-benchmarkTest and Results
N clients read simultaneously, randomly from a 320GB file set.
Each client read 1GB.
Each read is 4MB.
N clients write simultaneously to N distinct files.
Each client write 1GB.
Each write is 1MB.
N clients append simultaneously to one file.
Real World Clusters
Cluster A: R&D of over 100 engrs. Typical task:
Initiated by a human user and runs up to several hours.
Read (MBs – TBs) + Processed + Write results back.
Cluster B: Production data processing Tasks:
Long lasting. Continuously generate
and process multi-TB data sets.
Only occasion human intervening
Real World Measurements Table shows:
Sustained high throughput.
Light workload on master.
Besides: recovery A full recovery of a
chunkserver takes 23.2 minutes.
Prioritized recovery to a state that could tolerate 1 more failure: 2 minutes.
Workload Breakdown
Conclusion
Design too narrow for Google’s applications. Most the challenges are implementing---more
development component than research. However, GFS is a complete, deployed
solution.
Any opinions/comments?
Storage management and caching in PAST, a large-scale, persistent
peer-to-peer storage utility
Antony Rowstron, Peter Druschel
What is PAST?
An Internet-based, P2P global storage utility. An archival storage and content distribution utility,
not a general-purpose file system. Nodes form a self-organizing overlay network. Nodes may contribute storage. Files are inserted and retrieved, handled by fileID
and maybe a key. Files are immutable. PAST does not have a lookup service => built on
top of one, such as Pastry.
Goals
Strong persistence, High availability, Scalability, Security.
Background – Pastry
A P2P routing substrate. Given (fileID, msg), route msg to the node
whose nodeID is closest to fileID. Routing Costs = ceiling(log2
bN) steps. Eventual delivery is guaranteed, unless
floor(l/2) nodes with adjacent nodeID fail. Per-node maps of (2b-1)*ceiling(log2
bN) + 2l entries: nodeID IP address.
Node recovery’s done by O(log2bN) msgs.
Pastry – A closer look…
Routing: forward message with fileID
to a node that (nodeID) shares more digits with fileID than the current node.
if no such node found, fwd to node with similar match, but numerically closer.
Other nice properties: fault resilient, self-
organizing, scalable, efficient.
b = 2, l = 8
PAST Operations
Insert fileID := SHA-1(filename, pub key, salt) => unique File certificate is issued. Client’s quota is charged.
Lookup Based on fileID. Node returns file’s contents and certificate.
Reclaim Client issues reclaim certificate for authentication. Credit client’s quota; double checked by reclaim receipt.
Security Overview
Each node and each user hold a smartcard. Security model:
Infeasible to break the cryptosystems. Most nodes are well-behaved. Smartcards can’t be controlled by an attacker.
From smartcard, various certificates and receipts are generated to ensure security: file certificates, reclaim certificates, reclaim
receipts, etc.
Storage Management
Assumptions: Storage capacities of nodes differ by no more
than 2 orders of magnitude. Advertised capacity is the basis for the admission
of nodes. 2 conflicting responsibilities:
Balance free storage under stress, Keep k copies of each file fileID at k nodes with
nodeIDs closest to fileID.
I) Load Balancing
What causes load imbalance? Differences in:
#files / node (due to the dist. of nodeIDs and fileIDs). Size distribution of inserted files. Storage capacity of nodes.
What the solution aims for? Blur the differences by redistributing data:
Replica diversion: on local scale (relocate a replica among leaf nodes).
File diversion: on global scale (relocate all replicas w/ a different fileID).
N receive D
SD / FN > tpri
Store DIssue receipt
(Fwd D to k-1)
Replica diversion: choose
diversion node N’
N’ = maxstorage{x |
(x is N’s leaf) & (x’s fileID not in k-closest) & (not exist diverted replica)}
N’ not exist|| SD / FN’ > tdiv
Store DN N’
(k+1)st N’
File diversion
No
No
Yes
Yes
SD size of file DFN free space of Ntpriv primary threshold
SD size of file DFN’ free space of N’tdiv diversion threshold
II) Maintaining k Replicas
Problem: nodes join and leave. On joining:
Add ptr replaced node (~ replica diversion). Gradually migrate replicas back (background job).
On leaving: Each affected node picks a new kth
closest node, update its leaf set, and fwd replicas.
Notes: Extreme condition: “expand” the leaf set to 2l. Impossible to maintain k replicas if total storage decreases.
Optimizations
Storage: file encoding E.g.: Reed-Solomon encoding:
m replicas for each file m checksum for n files.
Performance: caching Goals: to reduce client-access latencies, maximize query
throughput & balance query load. Algorithm: GreedyDual-Size (GD-S)
Upon a hit: Hd = c(d) / s(d) Eviction:
Evict file v where Hv min.
Subtract Hv from remaining H values.
Experiments – Setup Workload 1:
8 web proxy logs from NLANR: 4 mil entries Reference 1,863,055 unique URLs 18.7GBs of contents Mean = 10517 Bs, median = 1,312 Bs, max = 138 MBs, min = 0 Bs.
Workload 2: Combining file name and size information from several file
systems: 2,027,908 files 166.6 GBs Mean = 88,233 Bs, median = 4,578 Bs, max = 2.7 GBs, min = 0 Bs.
System: k = 5, b = 4, N = 2250 Space contribution: 4 normal distribution (click to see figure.)
Experiment 0
Disable replica and file diversions: tpriv = 1
tdiv = 0 Reject upon first failure.
Results: File insertions failed = 51.1%, Storage utilization = 60.8%.
Storage Contribution & Leaf Set Size
Experiment: Workload 1 tpriv = 0.1
tdiv = 0.05
Results: Failures Utilization More leaves
=> better. d2 best.
Sensitivity of Replica Diversion Parameter tpri
Experiment: Workload 1 l = 32 tdiv = 0.05
tpri varies
Results: As tpri
Successful insertion Storage utilization
Sensitivity of File Diversion Paramerter tdiv
Experiment: Workload 1 l = 32 tpri = 0.1 tdiv varies
Results: As tdiv
Successful insertion Storage utilization
tpriv = 0.1 and tdiv = 0.05 yields best result.
Diversions
File diversions are negligible as long as storage utilization is
below 83%
Acceptable overhead
Insertion Failures w/ Respect to File Size
Workload 1tpriv = 0.1tdiv = 0.05
Workload 2tpriv = 0.1tdiv = 0.05
Experiments – Caching
Replica diversion increase
99% load, still effective due to small
files
Conclusion
PAST achieves its goals But:
Application specific Hard to deploy: what is the incentive for the nodes
to contribute storage?
Additional comments?