mariposa the google file system

04/21/23 EECS 584, Fall 2011 1

MariposaThe Google File System

Haowei Lu

Madhusudhanan Palani

04/21/23 EECS 584, Fall 2011 2

From LAN to WAN

Drawbacks of traditional distributed DBMS– Static Data Allocation

• Move objects manually

– Single Administrative Structure• Cost-based optimizer cannot scale well

– Uniformity• Different Machine Architecture

• Different Data Type

04/21/23 EECS 584, Fall 2011 3

From LAN to WAN

New requirements– Scalability to a large number of

cooperating sites:– Data mobility– No global synchronization– Total local autonomy– Easily configurable policies

04/21/23 EECS 584, Fall 2011 4

From LAN to WAN

Solution – A distributed microeconomic approach– Well studied economic model– Reduce scheduling complexity(?!)– Invisible hands for local optimum

04/21/23 EECS 584, Fall 2011 5

Mariposa

Let each site acts on its own behalf to maximize his own profit

In turn, it brings the overall performance of the DBMS ecosystem

04/21/23 EECS 584, Fall 2011 6

Architecture - Glossary

Fragment – Units of storage that are bought and sold by sites– Range distribution– Hash-Based distribution– Unstructured! Whenever site wants!

Stride– Operations that can proceed in parallel

04/21/23 EECS 584, Fall 2011 7

Architecture

04/21/23 EECS 584, Fall 2011 8

04/21/23 EECS 584, Fall 2011 9

The bidding process

04/21/23 EECS 584, Fall 2011 10

The bidding process The Broker: Send out requests for bid for query

plan The Bidder: Responds to the request for bid with

its formulated price and other information in the form:– (C,D,E) Cost, Delay, Expiration Date

The whole logic is implemented using RUSH– A low level, very efficient embedded scripting

language and rule system– Form: on <condition> do <action>

04/21/23 EECS 584, Fall 2011 11

The bidding process: Bidder

The Bidder: Setting the price for bid– Billing rate on a per-fragment basis– Consider site load

• Actual Bid = Computed bid * Load average

– Bid referencing hot list from storage manager

04/21/23 EECS 584, Fall 2011 12

The bidding process

04/21/23 EECS 584, Fall 2011 13

The bidding process: Broker

The Broker– Input: Fragmented query plan– In process: Decide the sites to run

fragments & send out bid acceptance• Expensive bid protocol• Purchase order protocol (mainly used)

– Output: hand off task to coordinator

Expensive bid protocol

04/21/23 EECS 584, Fall 2011 14


Broker

Bidder(Individual Sites)

Ads Table(Locate at Name Server)

Bookkeeping Table for previous winner sites(Same site as broker)

Under Budge

Purchase Order Protocol

04/21/23 EECS 584, Fall 2011 15


Broker The most possible bidder

RefuseAccept

GenerateBill

Pass to anotherSiteReturn to Broker

04/21/23 EECS 584, Fall 2011 16

The bidding process: Broker The Broker finds bidder using Ad table

04/21/23 EECS 584, Fall 2011 17

The bidding process: Broker The Broker finds bidder using Ad table Example (Sale Price)

– Query-Template: SELECT * FROM TMP– Sever Id: 123– Start Time: 2011/10/01– Expiration Time: 2011/10/04– Price: 10 unit– Delay: 5 seconds

04/21/23 EECS 584, Fall 2011 18

The bidding process: Broker Type of Ads (REALY FANCY)

04/21/23 EECS 584, Fall 2011 19

The bidding process: Bid Acceptance The main idea: Make the difference large as

possible– Difference:= B(D) – C (D: Delay, C: Cost, B(t): The

budget function)

Method: Greedy Algorithm– Pre-Step: Get the least D result– Iteration Steps:

• Calculate Cost Gradient CG:= cost reduce/delay increase for each stride

• Keep substitute using MAX(CG) until no increase on difference

04/21/23 EECS 584, Fall 2011 20

The bidding process: Separate Bidder Network bidder

– One trip to get

bandwidth– Return trip to

get price– Happen at

second

stage

04/21/23 EECS 584, Fall 2011 21

Storage Manager An asynchronous process which runs in

tandem with the bidder Objective

– Maximize revenue income per unit time Functions

– Calculate Fragment Values– Buy Fragments– Sell Fragments– Split/Coalesce Fragments

04/21/23 EECS 584, Fall 2011 22

Fragment Values The value of a fragment is defined using the revenue

history Revenue history consists of

– Query, No of records in result, time since last query, last revenue, delay, CPU & I/O used

CPU & I/O is normalized & stored in site independent units

Each site should– Convert this CPU & I/O units to site specific units via

weighting functions– Adjust revenue as the current node maybe faster or slower

by using the average bid curve

04/21/23 EECS 584, Fall 2011 23

Buying fragments In order to bid for a query/subquery the site should have

the referenced fragments The site can buy the fragments in advance (prefetch) or

when the query comes in (on demand) The buyer locates the owner of fragment and requests

revenue history Calculates the value of fragment Evict old fragments to free up space (alternate fragments)

– To the extent that space is available for new fragments

Buyer Offer price = value of fragment – value of alternate fragments + price received

04/21/23 EECS 584, Fall 2011 24

Selling Fragments Seller can evict the fragment being bought or

any other fragment(alternate) of equivalent size (Why is this a must?)

Seller will sell if – offer price > value of fragment (sell) – value of

alternate fragments + price received

If offer price is not sufficient, – then seller tries to evict fragment of higher value – lower the price of fragment as a final option

04/21/23 EECS 584, Fall 2011 25

Split & Coalesce When to Split/Coalesce?

– Split if there are too few fragments otherwise parallelization will take a hit

– Coalesce if there are too many fragments as overhead of dealing with the fragments & response time will take a hit

The algorithm for split/coalesce must strike the correct balance between the two

04/21/23 EECS 584, Fall 2011 26

How to solve this issue???An interlude

04/21/23 EECS 584, Fall 2011 27

Why not extend my microeconomics analogy!?!

Stonebreaker’s Microeconomics Idea

Market pressure should correct inappropriate fragment sizes

Large fragment size =>

Now everyone wants a share of the pie

But the owner does not want to lose

the revenue!

04/21/23 EECS 584, Fall 2011 28

The Idea Continued Break the large fragment

into smaller fragments

Smaller fragment means less revenue & less attractive for copies

04/21/23 EECS 584, Fall 2011 29

It still continues…. Smaller fragments also mean more

overhead => Works against the owner!

04/21/23 EECS 584, Fall 2011 30

And it ends…

So depending on the market demand these two opposing motivations will balance each other

04/21/23 EECS 584, Fall 2011 31

How to solve this issue???

04/21/23 EECS 584, Fall 2011 32

A more “concrete” approach !!

A more “concrete” approach... Mariposa will calculate expected delay (ED) due to

parallel execution on multiple fragments (Numc) It then computes the expected bid per site as

– B(ED)/Numc

Vary Numc to arrive at maximum revenue per site => Num*

Sites will keep track of this Num* to base their split/coalesce decision

**The sites should also ensure that the existing contracts are not affected

04/21/23 EECS 584, Fall 2011 33

Name Service Architecture

04/21/23 EECS 584, Fall 2011 34

Broker Name Service

Name server

Name server

Name server

Local sites

What are the different types of names?

Internal names: They are location dependent and carries info related to the physical location of the object

Full Names: They uniquely identify an object, are location independent & carries full info related to attributes of object

Common Names: They are user defined & defined within a name space.

Simple rules help translate common names to full names The missing components are usually derived from the

parameters supplied by the user or from the user’s environment

Name Context: This is similar to access modifiers in programming languages.

04/21/23 EECS 584, Fall 2011 35

How are names resolved? Name resolution helps discover the object

that is bound to a name– Common Name => Full Name– Full Name => Internal Name

The broker employs the following steps to resolve a name– Searches local cache– Rule driven search to resolve ambiguities– Query one or more name servers

04/21/23 EECS 584, Fall 2011 36

How is QOS of name Servers Defined?

Name servers helps translate common names to full names using name contexts provided by clients

Name service contacts various name servers Each name server maintains a composite set

of metadata of local sites under them It’s the role of name server to periodically

update its catalog QOS is defined as the combination of price &

staleness of this data

04/21/23 EECS 584, Fall 2011 37

Experiment

04/21/23 EECS 584, Fall 2011 38

The Query:– SELECT *

FROM R1(SB), R2(B), R3(SD)

WHERE R1.u1 = R2.u1

AND R2.u1 = R3.u1

The following statistics are available to the optimizer– R1 join R2 (1MB)– R2 join R3 (3MB)– R1 join R2 join R3 (4.5MB)

04/21/23 EECS 584, Fall 2011 39

The traditional Distributed RDBMS plans a query & sends the sub queries to the processing sites which is the same as purchase order protocol

Therefore the overhead due to Mariposa is the difference in elapsed time between the two protocols weighted by proportion of queries using the protocols

Bid price = (1.5 x estimated cost) x load average– Load average = 1

A node will sell a fragment if– Offer price > 2 X scan cost / load average

Decision to buy a fragment rather than subcontract is based on– Sale price <= Total money spent on scans

04/21/23 EECS 584, Fall 2011 40

The query optimizer chooses a plan based on the data transferred across the network

The initial plan generated by both Mariposa and the traditional systems will be similar

But due to migration of fragments subsequent executions of the same query will generate much better plans

04/21/23 EECS 584, Fall 2011 41

04/21/23 EECS 584, Fall 2011 42

04/21/23 EECS 584, Fall 2011 43

GFS - Topics Covered Motivation Architectural/File System Hierarchical

Overview Read/Write/Append/Snapshot Operation Key Design Parameters Replication & Rebalancing Garbage Collection Fault Tolerance

Motivation

Customized Needs Reliability Availability Performance Scalability

04/21/23 EECS 584, Fall 2011 44

Customized Needs How is it different?

Runs on commodity hardware where failure is not an exception rather an expectation (PC Vs Mac anyone?)

Huge Files (in the order of Multi-GBs) Writes involve only appending data unlike

traditional systems Applications that use these systems are in-

house! Files stored are primarily web documents

04/21/23 EECS 584, Fall 2011 45

04/21/23 EECS 584, Fall 2011 46

GFS - Topics Covered Motivation Architectural/File System Hierarchical


File System Hierarchy

04/21/23 EECS 584, Fall 2011 47

Directory

File 1

Chunk2

File n

Chunk1

Chunk0 Chunk3

Chunk4

Chunk5

Chunk servers

Master Server

64 Bit Globally Unique Ids

Types of servers Master server holds all meta data information such as

– Directory => File mapping

– File => Chunk mapping

– Chunk location

It keeps in touch with the chunk servers via heartbeat messages

Chunk servers store the actual chunks on local disks as Linux files

For reliability purposes chunks maybe replicated across multiple chunk servers

04/21/23 EECS 584, Fall 2011 48

04/21/23 EECS 584, Fall 2011 49

GFS - Topics Covered Motivation Architectural /File System Hierarchical


04/21/23 EECS 584, Fall 2011 50

Read Operation Using the fixed chunk size & user provided filename &

byte offset, the client translates it into a chunk index The filename & chunk index is then sent to the master to

get chunk location & the replica locations Client caches (limited time) this info using filename &

chunk index as key Client directly communicates with the closest chunk

server In order to minimize Client-Master interaction, the client

bunches up chunk location requests & master also sends chunks next to requested ones.

04/21/23 EECS 584, Fall 2011 51

Write Operation

04/21/23 EECS 584, Fall 2011 52

Write Operation1. The client requests a chunk from the master

2. Master assigns a chunk lease(60 seconds renewable) to a primary among replicas

3. The client then pushes the data to be written to the nearest chunk server1. Each chunk server in turn pushes this data into the next nearest

server

2. This ensures that the network bandwidth is fully utilized

4. Once all replicas have the data, the client pushes the write request to the primary1. The primary determines the order of mutations based on multiple

requests it receives from a single/multiple client(s)

04/21/23 EECS 584, Fall 2011 53

Write Operation5. The primary then pushes this ordering information to all replicas

6. The replicas then acknowledge the primary once the mutations have been successfully applied

7. The primary then acknowledges the client

Data flow is decoupled from control flow to ensure that the network topology dictates the throughput and not the choice of primary

Distance between two nodes is calculated by use of IP addresses

Use of switched network with full duplex links allows servers to forward data as soon as they start receiving it

04/21/23 EECS 584, Fall 2011 54

Record Append Operations Appends data to a file at least once atomically and returns the offset

back to the client

1. Client pushes data to all replicas

2. Sends request to the primary

3. Primary checks if chunk size would be exceeded1. If so pads the extra space of old chunk, creates a new chunk, instructs replicas to

do so and asks client to retry with new chunk

2. Else writes to chunk, instructs replicas to do so

4. If an append fails at any replica, the client retries the operation Single most commonly used operation by all distributed applications

in Google to write concurrently to a file This operation allows simple coordination schemes rather than

complex distributed locking mechanisms used in traditional writes

04/21/23 EECS 584, Fall 2011 55

Snapshot Operations This is used by applications for checkpointing their progress Creates an instant copy of file or directory tree while minimizing

interruptions to ongoing mutations Master revokes any outstanding leases on the chunks Master duplicates the meta data and it continues to point to

same chunk Upon first write request the master asks chunk server to

replicate the chunks Chunk is created on same chunk server thereby avoiding

network traffic

04/21/23 EECS 584, Fall 2011 56

04/21/23 EECS 584, Fall 2011 57



Replication & Rebalancing Chunks are replicated both across racks and within

racks This not only boosts availability, reliability etc but also

exploits aggregate bandwidth for reads Placement of chunks (balancing) depends on several

factors– To even out disk utilization across servers

– Limit the number of recent creations on chunk servers

– Spread replicas across racks

No of replicas is configurable and Master ensures that it doesn’t go below the threshold

04/21/23 EECS 584, Fall 2011 58

Replication & Rebalancing Priority on which chunks to re-replicate is assigned

by the master based on various factors like – distance from threshold– Live chunks over deleted chunks– Chunks blocking progress of clients

Master as well as clients throttle the cloning operations to ensure that they do not interfere with regular operations

Master also does periodic rebalancing for better load balancing & disk space utilization

04/21/23 EECS 584, Fall 2011 59

04/21/23 EECS 584, Fall 2011 60



Garbage Collection A file is logged for deletion by the master The file is not reclaimed immediately and is renamed to a

hidden name with deletion timestamp The master reclaims these files during its periodic scan if they

are older than 3 days While reclaiming the in memory metadata is erased, thus

severing its link to its chunks In a similar scan of chunk space the master identifies

orphaned chunks & erases corresponding metadata The files are reclaimed by the chunkservers upon confirmation

during regular heartbeat messages Stale replicas are also collected using version numbers

04/21/23 EECS 584, Fall 2011 61

Stale Replica Detection Each chunk is associated with a version number both

maintained by both the master and the chunk server The version number is incremented whenever a new

lease is granted If the chunk server version lags behind the master’s

version the chunk is marked for GC If the master’s version lags behind the chunk server, the

master is updated Also this version number is included in all sorts of

communications so that the client/chunk server can verify the version number before performing any operation

04/21/23 EECS 584, Fall 2011 62

04/21/23 EECS 584, Fall 2011 63



Fault Tolerance Both master and chunk server are designed

to restore their state and start in seconds Replication of chunks across racks and within

racks ensure high availability Monitoring infrastructure outside GFS

monitors master failure and starts a new master process within the replicated master servers

“Shadow” masters provide read-only access even though primary master is down

04/21/23 EECS 584, Fall 2011 64

Fault Tolerance A shadow master periodically applies the

growing primary master log to itself to keep up to date

It also periodically shares heartbeat messages with chunk servers to locate replicas

Integrity of data is maintained through checksums at both the servers

This verification is done during any read, write, or chunk migration request & also periodically

04/21/23 EECS 584, Fall 2011 65

Benchmark

04/21/23 EECS 584, Fall 2011 66

Measurements & Results

04/21/23 EECS 584, Fall 2011 67

04/21/23 EECS 584, Fall 2011 68

04/21/23 EECS 584, Fall 2011 69

04/21/23 EECS 584, Fall 2011 70

Key Design Parameters The choice of chunk size (64 MB) combined with the

nature of read/write offers several advantages:– Reduces client-master interaction– Many operations on the same chunk more likely– Reduces the size of meta data(can be held in primary

memory)

But hotspots can develop when many clients request the same chunk

But this can be suppressed with replication, staggered application start ups, P2P etc

04/21/23 EECS 584, Fall 2011 71

Key Design Parameters Metadata information is not persistent, it is collected

via heartbeat messages and is stored in main memory

This eliminates the need to keep the master in sync whenever chunk servers join/leave the cluster

Also given the chunk size, the metadata information to be stored in memory is greatly reduced

This small size also allows for periodic scanning of metadata for garbage collection, re replication & chunk migration without incurring much overhead

04/21/23 EECS 584, Fall 2011 72

Key Design Parameters Operation log maintains the transactional

information in GFS It employs checkpointing to keep the log size &

recovery time low These logs are replicated and located in multiple

servers to ensure reliability Any response to clients are provided only after the

logs are flushed to all these replicas

04/21/23 EECS 584, Fall 2011 73

mariposa the google file system

Documents

bidding processeecs

bidding processthe broker

computed bid

bid acceptancethe main

bidderthe bidder

site acts

query planthe bidder

brokerthe brokerinput