large scale sharing

Large Scale Sharing

The Google File System

PAST: Storage Management & Caching

– Presented by Chi H. Ho

Introduction

A next step from network file systems. How large?

GFS: > 1000 storage nodes > 300 TB disk storage Hundreds of client machines

PAST: Internet-scale

The Google File System

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Goals

Performance Scalability Reliability Availability Highly tuned for:

Google’s back-end file service Workloads: multiple-producer/single-consumer,

many-way merging

Assumptions

H/W: inexpensive components that often fail. Files: modest number of large files. Reads/Writes: 2 kinds

Large streaming: common case => optimized. Small random: supported but need not be efficient.

Concurrency: hundreds of concurrent appends. Performance: high sustained bandwidth is more

important than low latency.

Interface

Usual operations: create, delete, open, close, read, and write.

GFS specific: snapshot: creates a copy of a file or a directory

tree at low cost. record append: allows concurrent appends to

perform atomically.

Architecture

Architecture

User-level process

User-level process

User-level process

User-level process

Architecture (Files)

Files are divided into fixed-size chunks, each is replicated at multiple (default 3) chunkservers as a Linux file.

Each chunk is identified by an immutable and globally unique chunk handle assigned by the master at the time of chunk creation.

Read/Write data chunk specified by <chunk handle, byte range>

Architecture (Master)

Maintains metadata:•Namespace•Access control•Mapping files chunks•Chunks’ locations

Controls system-wide activities:•Chunk lease mamagement•Garbage collection•Chunk migration•And Heartbeat messages

Architecture (Client)Interacts

with Master for

metadata

Communicates directly with

chunkservers for data

Architecture (Notes)

No data cache is needed: Why?

• Client: ???

• Chunkservers: ???

Architecture (Notes)

No data cache is needed: Why?

• Client: most applications stream through huge files or have working sets too large to be cached.

• Chunkservers: already have Linux cache.

Single Master

Bottleneck?

Single point of failure?

Single Master

Bottleneck? Never read/write data thru. the master Only ask the master for chunks’ locations Prefetch multiple chunks Cache

Single point of failure? Master’s state is replicated on multiple machines. Mutations of master’s state are atomic. “Shadow” masters are temporarily used for read.

Chunk Size

Large: 64 MBs. Advantages:

Reduces client-master interaction. Reduces network overhead (use persistent TCP). Reduces size of metadata => kept in memory.

Disadvantages: Small files (small #chunks) may become hot spots.

Solutions: Small files => more replicas. Read from clients.

Metadata

Three major types: file and chunk namespaces, file-to-chunk mapping, locations of each chunk’s replicas.

}in master’s memory

Persistence issues: Namespaces and mapping: operation log stored

on multiple machines. Chunks’ locations: polled when master starts and

chunkservers joining, update by heartbeat msgs.

Operation Log

In the heart of GFS: The only persistent record of metadata, The logical time line that orders concurrent ops.

Operations are atomically committed. Recovery of master’s state is done by

replaying operations in the log.

Consistency

Metadata: solely controlled by the master Data: consistent after successful mutations.

Same order of mutations is applied on all replicas. Stale replicas (missing some mutations) are

detected and eliminated.

Consistent and clients see what the mutation

writes in its entirety

Clients see same data regardless which replica

Leases and Mutation Order

Lease: high-level chunk-based access control mechanism, granted by the master.

Global mutation order = lease grant order + serial number within a lease, chosen by the primary (lease holder).

Illustration of a mutation

ask for the lease holder of

the chunk

locations of primary and secondary

replicas

locate the lease or grant one if none exists.

cache the locations

push data to all replicas

store data in LRU buf.

and ack.

wait for all to ack.

write request

forward write request

assign serial no.

to request

request completed

reply (may be w/ errors)

Special Ops Revisited

Atomic Record Appends Master chooses offset. Up on failure: pad the failed replica(s), then retry. Guarantee: the record is appended to the file at

least once atomically. Snapshot

Copy-on-write. Used to make a copy of a file/directory quickly.

Master Operations

Namespace Management and Locking, To support concurrent master’s operations.

Replica Placement, To avoid dependent failures; to exploit network bandwidth.

Creation, Re-replication, Rebalancing, To better disk utilization, load balancing, fault tolerance.

Garbage Collection, Lazy deletion: simple, efficient, and support undelete.

Stale Replica Detection To eliminate obsolete replicas => garbage collected.

Fault Tolerance Sum Up

Master fails? Chunkservers fail? Disks corrupted? Network noise?

Micro-benchmarks

Configuration: 1 master, 2 replicas 16 chunkservers 16 clients Each machine: dual 1.4GHz PIII, 2GB mem,

2x80GB 5400rpm, full-duplex 100Mbps NIC.

}1 switch

1 switch

1Gbps

Micro-benchmarkTest and Results

N clients read simultaneously, randomly from a 320GB file set.

Each client read 1GB.

Each read is 4MB.

N clients write simultaneously to N distinct files.

Each client write 1GB.

Each write is 1MB.

N clients append simultaneously to one file.

Real World Clusters

Cluster A: R&D of over 100 engrs. Typical task:

Initiated by a human user and runs up to several hours.

Read (MBs – TBs) + Processed + Write results back.

Cluster B: Production data processing Tasks:

Long lasting. Continuously generate

and process multi-TB data sets.

Only occasion human intervening

Real World Measurements Table shows:

Sustained high throughput.

Light workload on master.

Besides: recovery A full recovery of a

chunkserver takes 23.2 minutes.

Prioritized recovery to a state that could tolerate 1 more failure: 2 minutes.

Workload Breakdown

Conclusion

Design too narrow for Google’s applications. Most the challenges are implementing---more

development component than research. However, GFS is a complete, deployed

solution.

Any opinions/comments?

Storage management and caching in PAST, a large-scale, persistent

peer-to-peer storage utility

Antony Rowstron, Peter Druschel

What is PAST?

An Internet-based, P2P global storage utility. An archival storage and content distribution utility,

not a general-purpose file system. Nodes form a self-organizing overlay network. Nodes may contribute storage. Files are inserted and retrieved, handled by fileID

and maybe a key. Files are immutable. PAST does not have a lookup service => built on

top of one, such as Pastry.

Goals

Strong persistence, High availability, Scalability, Security.

Background – Pastry

A P2P routing substrate. Given (fileID, msg), route msg to the node

whose nodeID is closest to fileID. Routing Costs = ceiling(log2

bN) steps. Eventual delivery is guaranteed, unless

floor(l/2) nodes with adjacent nodeID fail. Per-node maps of (2b-1)*ceiling(log2

bN) + 2l entries: nodeID IP address.

Node recovery’s done by O(log2bN) msgs.

Pastry – A closer look…

Routing: forward message with fileID

to a node that (nodeID) shares more digits with fileID than the current node.

if no such node found, fwd to node with similar match, but numerically closer.

Other nice properties: fault resilient, self-

organizing, scalable, efficient.

b = 2, l = 8

PAST Operations

Insert fileID := SHA-1(filename, pub key, salt) => unique File certificate is issued. Client’s quota is charged.

Lookup Based on fileID. Node returns file’s contents and certificate.

Reclaim Client issues reclaim certificate for authentication. Credit client’s quota; double checked by reclaim receipt.

Security Overview

Each node and each user hold a smartcard. Security model:

Infeasible to break the cryptosystems. Most nodes are well-behaved. Smartcards can’t be controlled by an attacker.

From smartcard, various certificates and receipts are generated to ensure security: file certificates, reclaim certificates, reclaim

receipts, etc.

Storage Management

Assumptions: Storage capacities of nodes differ by no more

than 2 orders of magnitude. Advertised capacity is the basis for the admission

of nodes. 2 conflicting responsibilities:

Balance free storage under stress, Keep k copies of each file fileID at k nodes with

nodeIDs closest to fileID.

I) Load Balancing

What causes load imbalance? Differences in:

#files / node (due to the dist. of nodeIDs and fileIDs). Size distribution of inserted files. Storage capacity of nodes.

What the solution aims for? Blur the differences by redistributing data:

Replica diversion: on local scale (relocate a replica among leaf nodes).

File diversion: on global scale (relocate all replicas w/ a different fileID).

N receive D

SD / FN > tpri

Store DIssue receipt

(Fwd D to k-1)

Replica diversion: choose

diversion node N’

N’ = maxstorage{x |

(x is N’s leaf) & (x’s fileID not in k-closest) & (not exist diverted replica)}

N’ not exist|| SD / FN’ > tdiv

Store DN N’

(k+1)st N’

File diversion

No

No

Yes

Yes

SD size of file DFN free space of Ntpriv primary threshold

SD size of file DFN’ free space of N’tdiv diversion threshold

II) Maintaining k Replicas

Problem: nodes join and leave. On joining:

Add ptr replaced node (~ replica diversion). Gradually migrate replicas back (background job).

On leaving: Each affected node picks a new kth

closest node, update its leaf set, and fwd replicas.

Notes: Extreme condition: “expand” the leaf set to 2l. Impossible to maintain k replicas if total storage decreases.

Optimizations

Storage: file encoding E.g.: Reed-Solomon encoding:

m replicas for each file m checksum for n files.

Performance: caching Goals: to reduce client-access latencies, maximize query

throughput & balance query load. Algorithm: GreedyDual-Size (GD-S)

Upon a hit: Hd = c(d) / s(d) Eviction:

Evict file v where Hv min.

Subtract Hv from remaining H values.

Experiments – Setup Workload 1:

8 web proxy logs from NLANR: 4 mil entries Reference 1,863,055 unique URLs 18.7GBs of contents Mean = 10517 Bs, median = 1,312 Bs, max = 138 MBs, min = 0 Bs.

Workload 2: Combining file name and size information from several file

systems: 2,027,908 files 166.6 GBs Mean = 88,233 Bs, median = 4,578 Bs, max = 2.7 GBs, min = 0 Bs.

System: k = 5, b = 4, N = 2250 Space contribution: 4 normal distribution (click to see figure.)

Experiment 0

Disable replica and file diversions: tpriv = 1

tdiv = 0 Reject upon first failure.

Results: File insertions failed = 51.1%, Storage utilization = 60.8%.

Storage Contribution & Leaf Set Size

Experiment: Workload 1 tpriv = 0.1

tdiv = 0.05

Results: Failures Utilization More leaves

=> better. d2 best.

Sensitivity of Replica Diversion Parameter tpri

Experiment: Workload 1 l = 32 tdiv = 0.05

tpri varies

Results: As tpri

Successful insertion Storage utilization

Sensitivity of File Diversion Paramerter tdiv

Experiment: Workload 1 l = 32 tpri = 0.1 tdiv varies

Results: As tdiv

Successful insertion Storage utilization

tpriv = 0.1 and tdiv = 0.05 yields best result.

Diversions

File diversions are negligible as long as storage utilization is

below 83%

Acceptable overhead

Insertion Failures w/ Respect to File Size

Workload 1tpriv = 0.1tdiv = 0.05

Workload 2tpriv = 0.1tdiv = 0.05

Experiments – Caching

Replica diversion increase

99% load, still effective due to small

files

Conclusion

PAST achieves its goals But:

Application specific Hard to deploy: what is the incentive for the nodes

to contribute storage?

Additional comments?

large scale sharing

Documents