teoria efectului defectului hardware: googlefs

Teoria efectului defectuluihardware: GoogleFS

(a whitepaper review)

Dan Şerban

:: The Google File System was created by man atthe dawn of the 21st century

:: In July 2010, it had not yet become self-aware

… but seriously now …

:: GoogleFS is not really a file system:: GoogleFS is a storage framework for doing

a very specific type of parallel computing(MapReduce) on massive amounts of data

http://labs.google.com/papers/gfs.html

And so it begins...

1) Mappers read (key1, valueA) pairsfrom GoogleFS input files named key1.txt(large streaming reads)

2) Mappers derive key2 and valueB from valueA3) Mappers record-append (key2_key1, valueB) pairs

to temporary GoogleFS files named key2_key1.txt4) Reducers read (key2_key1, valueB) pairs

from those temporary key2_key1.txt files(large streaming reads)

5) Reducers aggregate by key2:valueC = aggregateFunction(valueB)

6) Reducers record-append (key2, valueC) pairsto permanent GoogleFS files named key2.bigtbl

Overview of MapReduce

The problem: Web Indexing

The solution:: Find the sweet spot is in price / performance

for hardware components:: Build a large scale network of machines

from those components:: Make the assumption of frequent hardware

failures the single most important designprinciple for building the file system

:: Apply highly sophisticated engineering to makethe system reliable in software

The philosophy behind GoogleFS

:: The quality and quantity of componentstogether make frequent hardware failuresthe norm rather than the exception- Do not completely trust the machines- Do not completely trust the disks

:: GoogleFS must constantly monitor itselfand detect, tolerate, and recover promptlyfrom component failures on a routine basis

:: Streaming data access to large data setsis the focus of GoogleFS optimization(disks are increasingly becoming seek-limited)

:: High sustained throughput is more importantthan low latency

Assumptions and Design Aspects

:: Relaxed consistency model(slightly adjusted and extended file system API)

:: GoogleFS must efficiently implementwell-defined semantics for multiple clientsconcurrently appending to the same filewith minimal synchronization overhead

:: Prefer moving computation to where the data isvs. moving data over the networkto be processed elsewhere

:: Application designers don't have to deal withcomponent failures or where data is located

Assumptions and Design Aspects

:: Google controls both layers of the architectural stack:the file system and the MapReduce applications

:: Applications only use a few specific access patterns:- Large, sustained streaming reads

(optimized for throughput)- Occasional small, but very seek-intensive

random reads (supported, not optimized for withinGoogleFS, but rather via app-level read barriers)

- Large, sustained sequential appendsby a single client (optimized for throughput)

- Simultaneous atomic record appendsby a large number of clients(optimized for heavy-duty concurrencyby relaxing the consistency model)

What GoogleFS exploits

One master, multiple chunkservers, multiple clients

Who is who?

:: A GoogleFS cluster consists of a single masterand multiple chunkservers- The cluster is accessed by multiple clients

:: Clients, chunkservers and the masterare implemented as userland processesrunning on commodity Linux machines

:: Files are divided into fixed-size chunks- 64 MB in GoogleFS v1- 1 MB in GoogleFS v2 / Caffeine

Who is who?

:: The master stores three major types of data:- The file and chunk namespaces- The file-to-chunk mappings- The locations of each chunk's replicas

:: All metadata is kept in the master's memory:: The master is not involved in any of the data transfers

- It's not on the critical path for moving data:: It's also responsible for making decisions about

chunk allocation, placement and replication:: The master periodically communicates with each

chunkserver in HeartBeat messages to give itinstructions and collect its state

:: The master delegates low-level work via chunk lease:: It was the single point of failure in GoogleFS v1:: GoogleFS v2 / Caffeine now supports multiple masters

Who is who - The Master

:: Chunkservers are servers that store chunks of dataon their local Linux file systems

:: They have no knowledge of the GoogleFS metadata- They only know about their own chunks

:: Each chunk is identified by an immutableand globally unique 64-bit chunk handleassigned by the master at the time of chunk creation

:: Chunks become read-only once they have beencompletely written (file writes are append-only)

:: For reliability, each chunk is replicated on multiplechunkservers (the default is 3 replicas)

:: Records appended to a chunk are checksummedon the way in and verified before they are deliveredto clients

Who is who - Chunkservers

:: Clients are Linux machines running distributedapplications which are driving various kindsof MapReduce workloads

:: GoogleFS client code linked into each applicationimplements the file system API and communicateswith the master and chunkservers to read or writedata on behalf of the application

:: Clients interact with the master for metadataoperations, but all data-bearing communicationgoes directly to the chunkservers

:: Clients cache metadata (for a limited period of time)but they never cache chunk data

Who is who - Clients

Appending

Regular append(optimized for throughput from single client)

:: Client specifies:- file name- offset- data

:: Clients sends data to chunkserversin pipeline fashion

:: Client sends "append-commit" to the lease holder

:: The LH gets ACK back from all chunkservers involved:: The LH sends ACK back to the client

Appending

Atomic record append(optimized for heavy-duty concurrency):: Client specifies:

- file name

- data:: Clients sends data to chunkservers

in pipeline fashion:: Client sends "append-commit" to the lease holder:: The LH serializes commits:: The LH gets ACK back from all chunkservers involved:: The LH sends ACK back to the client:: The LH converts some of the requests into pad reqs

Appending

The master initiates and coordinatesintelligently prioritized re-replication of chunks

:: First priority is to bring the cluster to a basiclevel of redundancy where it can safely tolerateanother node failure (2x replication)

:: Second priority is to minimize impacton application progress (3x replication for live chunks)

:: Third priority is to restore full overall redundancy(3x replication for all chunks, live and read-only)

The master also performs re-replication of a chunk if a chunkserver signals a checksum mismatch when delivering data to clients

Chunkserver goes offline

The master:

:: asks the chunkserver about each chunk replicaunder its control

:: performs a version number cross-checkwith its own metadata

:: instructs the chunkserver to deletestale chunk replicas (those that were live at the timethe chunkserver went offline)

After successfully joining the cluster, the chunkserver periodically polls the master about each chunk replica's status

Chunkserver rejoins the cluster

:: Biased towards topological spreading:: Creation and re-replication

- Focused on spreading the write load- Density of recent chunk creations used as a proxy

for the write load:: Rebalancing

- Gently moving chunks aroundto balance disk fullness

- Low priority, fixes imbalances causedby new chunkservers joining the clusterfor the first time

Replica management

Questions / Feedback

teoria efectului defectului hardware: googlefs

Technology