mariposa the google file system

73
06/09/22 EECS 584, Fall 2011 1 Mariposa The Google File System Haowei Lu Madhusudhanan Palani

Upload: nolen

Post on 21-Jan-2016

68 views

Category:

Documents


0 download

DESCRIPTION

Mariposa The Google File System. Haowei Lu Madhusudhanan Palani. From LAN to WAN. Drawbacks of traditional distributed DBMS Static Data Allocation Move objects manually Single Administrative Structure Cost-based optimizer cannot scale well Uniformity Different Machine Architecture - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 1

MariposaThe Google File System

Haowei Lu

Madhusudhanan Palani

Page 2: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 2

From LAN to WAN

Drawbacks of traditional distributed DBMS– Static Data Allocation

• Move objects manually

– Single Administrative Structure• Cost-based optimizer cannot scale well

– Uniformity• Different Machine Architecture

• Different Data Type

Page 3: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 3

From LAN to WAN

New requirements– Scalability to a large number of

cooperating sites:– Data mobility– No global synchronization– Total local autonomy– Easily configurable policies

Page 4: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 4

From LAN to WAN

Solution – A distributed microeconomic approach– Well studied economic model– Reduce scheduling complexity(?!)– Invisible hands for local optimum

Page 5: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 5

Mariposa

Let each site acts on its own behalf to maximize his own profit

In turn, it brings the overall performance of the DBMS ecosystem

Page 6: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 6

Architecture - Glossary

Fragment – Units of storage that are bought and sold by sites– Range distribution– Hash-Based distribution– Unstructured! Whenever site wants!

Stride– Operations that can proceed in parallel

Page 7: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 7

Architecture

Page 8: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 8

Page 9: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 9

The bidding process

Page 10: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 10

The bidding process The Broker: Send out requests for bid for query

plan The Bidder: Responds to the request for bid with

its formulated price and other information in the form:– (C,D,E) Cost, Delay, Expiration Date

The whole logic is implemented using RUSH– A low level, very efficient embedded scripting

language and rule system– Form: on <condition> do <action>

Page 11: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 11

The bidding process: Bidder

The Bidder: Setting the price for bid– Billing rate on a per-fragment basis– Consider site load

• Actual Bid = Computed bid * Load average

– Bid referencing hot list from storage manager

Page 12: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 12

The bidding process

Page 13: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 13

The bidding process: Broker

The Broker– Input: Fragmented query plan– In process: Decide the sites to run

fragments & send out bid acceptance• Expensive bid protocol• Purchase order protocol (mainly used)

– Output: hand off task to coordinator

Page 14: Mariposa The Google File System

Expensive bid protocol

04/21/23 EECS 584, Fall 2011 14

The bidding process: Broker

Broker

Bidder(Individual Sites)

Ads Table(Locate at Name Server)

Bookkeeping Table for previous winner sites(Same site as broker)

Under Budge

Page 15: Mariposa The Google File System

Purchase Order Protocol

04/21/23 EECS 584, Fall 2011 15

The bidding process: Broker

Broker The most possible bidder

RefuseAccept

GenerateBill

Pass to anotherSiteReturn to Broker

Page 16: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 16

The bidding process: Broker The Broker finds bidder using Ad table

Page 17: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 17

The bidding process: Broker The Broker finds bidder using Ad table Example (Sale Price)

– Query-Template: SELECT * FROM TMP– Sever Id: 123– Start Time: 2011/10/01– Expiration Time: 2011/10/04– Price: 10 unit– Delay: 5 seconds

Page 18: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 18

The bidding process: Broker Type of Ads (REALY FANCY)

Page 19: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 19

The bidding process: Bid Acceptance The main idea: Make the difference large as

possible– Difference:= B(D) – C (D: Delay, C: Cost, B(t): The

budget function)

Method: Greedy Algorithm– Pre-Step: Get the least D result– Iteration Steps:

• Calculate Cost Gradient CG:= cost reduce/delay increase for each stride

• Keep substitute using MAX(CG) until no increase on difference

Page 20: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 20

The bidding process: Separate Bidder Network bidder

– One trip to get

bandwidth– Return trip to

get price– Happen at

second

stage

Page 21: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 21

Page 22: Mariposa The Google File System

Storage Manager An asynchronous process which runs in

tandem with the bidder Objective

– Maximize revenue income per unit time Functions

– Calculate Fragment Values– Buy Fragments– Sell Fragments– Split/Coalesce Fragments

04/21/23 EECS 584, Fall 2011 22

Page 23: Mariposa The Google File System

Fragment Values The value of a fragment is defined using the revenue

history Revenue history consists of

– Query, No of records in result, time since last query, last revenue, delay, CPU & I/O used

CPU & I/O is normalized & stored in site independent units

Each site should– Convert this CPU & I/O units to site specific units via

weighting functions– Adjust revenue as the current node maybe faster or slower

by using the average bid curve

04/21/23 EECS 584, Fall 2011 23

Page 24: Mariposa The Google File System

Buying fragments In order to bid for a query/subquery the site should have

the referenced fragments The site can buy the fragments in advance (prefetch) or

when the query comes in (on demand) The buyer locates the owner of fragment and requests

revenue history Calculates the value of fragment Evict old fragments to free up space (alternate fragments)

– To the extent that space is available for new fragments

Buyer Offer price = value of fragment – value of alternate fragments + price received

04/21/23 EECS 584, Fall 2011 24

Page 25: Mariposa The Google File System

Selling Fragments Seller can evict the fragment being bought or

any other fragment(alternate) of equivalent size (Why is this a must?)

Seller will sell if – offer price > value of fragment (sell) – value of

alternate fragments + price received

If offer price is not sufficient, – then seller tries to evict fragment of higher value – lower the price of fragment as a final option

04/21/23 EECS 584, Fall 2011 25

Page 26: Mariposa The Google File System

Split & Coalesce When to Split/Coalesce?

– Split if there are too few fragments otherwise parallelization will take a hit

– Coalesce if there are too many fragments as overhead of dealing with the fragments & response time will take a hit

The algorithm for split/coalesce must strike the correct balance between the two

04/21/23 EECS 584, Fall 2011 26

Page 27: Mariposa The Google File System

How to solve this issue???An interlude

04/21/23 EECS 584, Fall 2011 27

Why not extend my microeconomics analogy!?!

Page 28: Mariposa The Google File System

Stonebreaker’s Microeconomics Idea

Market pressure should correct inappropriate fragment sizes

Large fragment size =>

Now everyone wants a share of the pie

But the owner does not want to lose

the revenue!

04/21/23 EECS 584, Fall 2011 28

Page 29: Mariposa The Google File System

The Idea Continued Break the large fragment

into smaller fragments

Smaller fragment means less revenue & less attractive for copies

04/21/23 EECS 584, Fall 2011 29

Page 30: Mariposa The Google File System

It still continues…. Smaller fragments also mean more

overhead => Works against the owner!

04/21/23 EECS 584, Fall 2011 30

Page 31: Mariposa The Google File System

And it ends…

So depending on the market demand these two opposing motivations will balance each other

04/21/23 EECS 584, Fall 2011 31

Page 32: Mariposa The Google File System

How to solve this issue???

04/21/23 EECS 584, Fall 2011 32

A more “concrete” approach !!

Page 33: Mariposa The Google File System

A more “concrete” approach... Mariposa will calculate expected delay (ED) due to

parallel execution on multiple fragments (Numc) It then computes the expected bid per site as

– B(ED)/Numc

Vary Numc to arrive at maximum revenue per site => Num*

Sites will keep track of this Num* to base their split/coalesce decision

**The sites should also ensure that the existing contracts are not affected

04/21/23 EECS 584, Fall 2011 33

Page 34: Mariposa The Google File System

Name Service Architecture

04/21/23 EECS 584, Fall 2011 34

Broker Name Service

Name server

Name server

Name server

Local sites

Page 35: Mariposa The Google File System

What are the different types of names?

Internal names: They are location dependent and carries info related to the physical location of the object

Full Names: They uniquely identify an object, are location independent & carries full info related to attributes of object

Common Names: They are user defined & defined within a name space.

Simple rules help translate common names to full names The missing components are usually derived from the

parameters supplied by the user or from the user’s environment

Name Context: This is similar to access modifiers in programming languages.

04/21/23 EECS 584, Fall 2011 35

Page 36: Mariposa The Google File System

How are names resolved? Name resolution helps discover the object

that is bound to a name– Common Name => Full Name– Full Name => Internal Name

The broker employs the following steps to resolve a name– Searches local cache– Rule driven search to resolve ambiguities– Query one or more name servers

04/21/23 EECS 584, Fall 2011 36

Page 37: Mariposa The Google File System

How is QOS of name Servers Defined?

Name servers helps translate common names to full names using name contexts provided by clients

Name service contacts various name servers Each name server maintains a composite set

of metadata of local sites under them It’s the role of name server to periodically

update its catalog QOS is defined as the combination of price &

staleness of this data

04/21/23 EECS 584, Fall 2011 37

Page 38: Mariposa The Google File System

Experiment

04/21/23 EECS 584, Fall 2011 38

Page 39: Mariposa The Google File System

The Query:– SELECT *

FROM R1(SB), R2(B), R3(SD)

WHERE R1.u1 = R2.u1

AND R2.u1 = R3.u1

The following statistics are available to the optimizer– R1 join R2 (1MB)– R2 join R3 (3MB)– R1 join R2 join R3 (4.5MB)

04/21/23 EECS 584, Fall 2011 39

Page 40: Mariposa The Google File System

The traditional Distributed RDBMS plans a query & sends the sub queries to the processing sites which is the same as purchase order protocol

Therefore the overhead due to Mariposa is the difference in elapsed time between the two protocols weighted by proportion of queries using the protocols

Bid price = (1.5 x estimated cost) x load average– Load average = 1

A node will sell a fragment if– Offer price > 2 X scan cost / load average

Decision to buy a fragment rather than subcontract is based on– Sale price <= Total money spent on scans

04/21/23 EECS 584, Fall 2011 40

Page 41: Mariposa The Google File System

The query optimizer chooses a plan based on the data transferred across the network

The initial plan generated by both Mariposa and the traditional systems will be similar

But due to migration of fragments subsequent executions of the same query will generate much better plans

04/21/23 EECS 584, Fall 2011 41

Page 42: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 42

Page 43: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 43

GFS - Topics Covered Motivation Architectural/File System Hierarchical

Overview Read/Write/Append/Snapshot Operation Key Design Parameters Replication & Rebalancing Garbage Collection Fault Tolerance

Page 44: Mariposa The Google File System

Motivation

Customized Needs Reliability Availability Performance Scalability

04/21/23 EECS 584, Fall 2011 44

Page 45: Mariposa The Google File System

Customized Needs How is it different?

Runs on commodity hardware where failure is not an exception rather an expectation (PC Vs Mac anyone?)

Huge Files (in the order of Multi-GBs) Writes involve only appending data unlike

traditional systems Applications that use these systems are in-

house! Files stored are primarily web documents

04/21/23 EECS 584, Fall 2011 45

Page 46: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 46

GFS - Topics Covered Motivation Architectural/File System Hierarchical

Overview Read/Write/Append/Snapshot Operation Key Design Parameters Replication & Rebalancing Garbage Collection Fault Tolerance

Page 47: Mariposa The Google File System

File System Hierarchy

04/21/23 EECS 584, Fall 2011 47

Directory

File 1

Chunk2

File n

Chunk1

Chunk0 Chunk3

Chunk4

Chunk5

Chunk servers

Master Server

64 Bit Globally Unique Ids

Page 48: Mariposa The Google File System

Types of servers Master server holds all meta data information such as

– Directory => File mapping

– File => Chunk mapping

– Chunk location

It keeps in touch with the chunk servers via heartbeat messages

Chunk servers store the actual chunks on local disks as Linux files

For reliability purposes chunks maybe replicated across multiple chunk servers

04/21/23 EECS 584, Fall 2011 48

Page 49: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 49

GFS - Topics Covered Motivation Architectural /File System Hierarchical

Overview Read/Write/Append/Snapshot Operation Key Design Parameters Replication & Rebalancing Garbage Collection Fault Tolerance

Page 50: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 50

Page 51: Mariposa The Google File System

Read Operation Using the fixed chunk size & user provided filename &

byte offset, the client translates it into a chunk index The filename & chunk index is then sent to the master to

get chunk location & the replica locations Client caches (limited time) this info using filename &

chunk index as key Client directly communicates with the closest chunk

server In order to minimize Client-Master interaction, the client

bunches up chunk location requests & master also sends chunks next to requested ones.

04/21/23 EECS 584, Fall 2011 51

Page 52: Mariposa The Google File System

Write Operation

04/21/23 EECS 584, Fall 2011 52

Page 53: Mariposa The Google File System

Write Operation1. The client requests a chunk from the master

2. Master assigns a chunk lease(60 seconds renewable) to a primary among replicas

3. The client then pushes the data to be written to the nearest chunk server1. Each chunk server in turn pushes this data into the next nearest

server

2. This ensures that the network bandwidth is fully utilized

4. Once all replicas have the data, the client pushes the write request to the primary1. The primary determines the order of mutations based on multiple

requests it receives from a single/multiple client(s)

04/21/23 EECS 584, Fall 2011 53

Page 54: Mariposa The Google File System

Write Operation5. The primary then pushes this ordering information to all replicas

6. The replicas then acknowledge the primary once the mutations have been successfully applied

7. The primary then acknowledges the client

Data flow is decoupled from control flow to ensure that the network topology dictates the throughput and not the choice of primary

Distance between two nodes is calculated by use of IP addresses

Use of switched network with full duplex links allows servers to forward data as soon as they start receiving it

04/21/23 EECS 584, Fall 2011 54

Page 55: Mariposa The Google File System

Record Append Operations Appends data to a file at least once atomically and returns the offset

back to the client

1. Client pushes data to all replicas

2. Sends request to the primary

3. Primary checks if chunk size would be exceeded1. If so pads the extra space of old chunk, creates a new chunk, instructs replicas to

do so and asks client to retry with new chunk

2. Else writes to chunk, instructs replicas to do so

4. If an append fails at any replica, the client retries the operation Single most commonly used operation by all distributed applications

in Google to write concurrently to a file This operation allows simple coordination schemes rather than

complex distributed locking mechanisms used in traditional writes

04/21/23 EECS 584, Fall 2011 55

Page 56: Mariposa The Google File System

Snapshot Operations This is used by applications for checkpointing their progress Creates an instant copy of file or directory tree while minimizing

interruptions to ongoing mutations Master revokes any outstanding leases on the chunks Master duplicates the meta data and it continues to point to

same chunk Upon first write request the master asks chunk server to

replicate the chunks Chunk is created on same chunk server thereby avoiding

network traffic

04/21/23 EECS 584, Fall 2011 56

Page 57: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 57

GFS - Topics Covered Motivation Architectural /File System Hierarchical

Overview Read/Write/Append/Snapshot Operation Key Design Parameters Replication & Rebalancing Garbage Collection Fault Tolerance

Page 58: Mariposa The Google File System

Replication & Rebalancing Chunks are replicated both across racks and within

racks This not only boosts availability, reliability etc but also

exploits aggregate bandwidth for reads Placement of chunks (balancing) depends on several

factors– To even out disk utilization across servers

– Limit the number of recent creations on chunk servers

– Spread replicas across racks

No of replicas is configurable and Master ensures that it doesn’t go below the threshold

04/21/23 EECS 584, Fall 2011 58

Page 59: Mariposa The Google File System

Replication & Rebalancing Priority on which chunks to re-replicate is assigned

by the master based on various factors like – distance from threshold– Live chunks over deleted chunks– Chunks blocking progress of clients

Master as well as clients throttle the cloning operations to ensure that they do not interfere with regular operations

Master also does periodic rebalancing for better load balancing & disk space utilization

04/21/23 EECS 584, Fall 2011 59

Page 60: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 60

GFS - Topics Covered Motivation Architectural /File System Hierarchical

Overview Read/Write/Append/Snapshot Operation Key Design Parameters Replication & Rebalancing Garbage Collection Fault Tolerance

Page 61: Mariposa The Google File System

Garbage Collection A file is logged for deletion by the master The file is not reclaimed immediately and is renamed to a

hidden name with deletion timestamp The master reclaims these files during its periodic scan if they

are older than 3 days While reclaiming the in memory metadata is erased, thus

severing its link to its chunks In a similar scan of chunk space the master identifies

orphaned chunks & erases corresponding metadata The files are reclaimed by the chunkservers upon confirmation

during regular heartbeat messages Stale replicas are also collected using version numbers

04/21/23 EECS 584, Fall 2011 61

Page 62: Mariposa The Google File System

Stale Replica Detection Each chunk is associated with a version number both

maintained by both the master and the chunk server The version number is incremented whenever a new

lease is granted If the chunk server version lags behind the master’s

version the chunk is marked for GC If the master’s version lags behind the chunk server, the

master is updated Also this version number is included in all sorts of

communications so that the client/chunk server can verify the version number before performing any operation

04/21/23 EECS 584, Fall 2011 62

Page 63: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 63

GFS - Topics Covered Motivation Architectural /File System Hierarchical

Overview Read/Write/Append/Snapshot Operation Key Design Parameters Replication & Rebalancing Garbage Collection Fault Tolerance

Page 64: Mariposa The Google File System

Fault Tolerance Both master and chunk server are designed

to restore their state and start in seconds Replication of chunks across racks and within

racks ensure high availability Monitoring infrastructure outside GFS

monitors master failure and starts a new master process within the replicated master servers

“Shadow” masters provide read-only access even though primary master is down

04/21/23 EECS 584, Fall 2011 64

Page 65: Mariposa The Google File System

Fault Tolerance A shadow master periodically applies the

growing primary master log to itself to keep up to date

It also periodically shares heartbeat messages with chunk servers to locate replicas

Integrity of data is maintained through checksums at both the servers

This verification is done during any read, write, or chunk migration request & also periodically

04/21/23 EECS 584, Fall 2011 65

Page 66: Mariposa The Google File System

Benchmark

04/21/23 EECS 584, Fall 2011 66

Page 67: Mariposa The Google File System

Measurements & Results

04/21/23 EECS 584, Fall 2011 67

Page 68: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 68

Page 69: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 69

Page 70: Mariposa The Google File System

04/21/23 EECS 584, Fall 2011 70

Page 71: Mariposa The Google File System

Key Design Parameters The choice of chunk size (64 MB) combined with the

nature of read/write offers several advantages:– Reduces client-master interaction– Many operations on the same chunk more likely– Reduces the size of meta data(can be held in primary

memory)

But hotspots can develop when many clients request the same chunk

But this can be suppressed with replication, staggered application start ups, P2P etc

04/21/23 EECS 584, Fall 2011 71

Page 72: Mariposa The Google File System

Key Design Parameters Metadata information is not persistent, it is collected

via heartbeat messages and is stored in main memory

This eliminates the need to keep the master in sync whenever chunk servers join/leave the cluster

Also given the chunk size, the metadata information to be stored in memory is greatly reduced

This small size also allows for periodic scanning of metadata for garbage collection, re replication & chunk migration without incurring much overhead

04/21/23 EECS 584, Fall 2011 72

Page 73: Mariposa The Google File System

Key Design Parameters Operation log maintains the transactional

information in GFS It employs checkpointing to keep the log size &

recovery time low These logs are replicated and located in multiple

servers to ensure reliability Any response to clients are provided only after the

logs are flushed to all these replicas

04/21/23 EECS 584, Fall 2011 73