mariposa the google file system

04/21/23 EECS 584, Fall 2011 1

MariposaThe Google File System

Haowei Lu

Madhusudhanan Palani

04/21/23 EECS 584, Fall 2011 2

From LAN to WAN

Drawbacks of traditional distributed DBMS– Static Data Allocation

• Move objects manually

– Single Administrative Structure• Cost-based optimizer cannot scale well

– Uniformity• Different Machine Architecture

• Different Data Type

04/21/23 EECS 584, Fall 2011 3

From LAN to WAN

New requirements– Scalability to a large number of

cooperating sites:– Data mobility– No global synchronization– Total local autonomy– Easily configurable policies

04/21/23 EECS 584, Fall 2011 4

From LAN to WAN

Solution – A distributed microeconomic approach– Well studied economic model– Reduce scheduling complexity(?!)– Invisible hands for local optimum

04/21/23 EECS 584, Fall 2011 5

Mariposa

Let each site acts on its own behalf to maximize his own profit

In turn, it brings the overall performance of the DBMS ecosystem

04/21/23 EECS 584, Fall 2011 6

Architecture - Glossary

Fragment – Units of storage that are bought and sold by sites– Range distribution– Hash-Based distribution– Unstructured! Whenever site wants!

Stride– Operations that can proceed in parallel

04/21/23 EECS 584, Fall 2011 7

Architecture

04/21/23 EECS 584, Fall 2011 8

04/21/23 EECS 584, Fall 2011 9

The bidding process

04/21/23 EECS 584, Fall 2011 10

The bidding process The Broker: Send out requests for bid for query

plan The Bidder: Responds to the request for bid with

its formulated price and other information in the form:– (C,D,E) Cost, Delay, Expiration Date

The whole logic is implemented using RUSH– A low level, very efficient embedded scripting

language and rule system– Form: on <condition> do <action>

04/21/23 EECS 584, Fall 2011 11

The bidding process: Bidder

The Bidder: Setting the price for bid– Billing rate on a per-fragment basis– Consider site load

• Actual Bid = Computed bid * Load average

– Bid referencing hot list from storage manager

04/21/23 EECS 584, Fall 2011 12

The bidding process

04/21/23 EECS 584, Fall 2011 13

The bidding process: Broker

The Broker– Input: Fragmented query plan– In process: Decide the sites to run

fragments & send out bid acceptance• Expensive bid protocol• Purchase order protocol (mainly used)

– Output: hand off task to coordinator

Expensive bid protocol

04/21/23 EECS 584, Fall 2011 14

Broker

Bidder(Individual Sites)

Ads Table(Locate at Name Server)

Bookkeeping Table for previous winner sites(Same site as broker)

Under Budge

Purchase Order Protocol

04/21/23 EECS 584, Fall 2011 15

Broker The most possible bidder

RefuseAccept

GenerateBill

Pass to anotherSiteReturn to Broker

04/21/23 EECS 584, Fall 2011 16

The bidding process: Broker The Broker finds bidder using Ad table

04/21/23 EECS 584, Fall 2011 17

The bidding process: Broker The Broker finds bidder using Ad table Example (Sale Price)

– Query-Template: SELECT * FROM TMP– Sever Id: 123– Start Time: 2011/10/01– Expiration Time: 2011/10/04– Price: 10 unit– Delay: 5 seconds

04/21/23 EECS 584, Fall 2011 18

The bidding process: Broker Type of Ads (REALY FANCY)

04/21/23 EECS 584, Fall 2011 19

The bidding process: Bid Acceptance The main idea: Make the difference large as

possible– Difference:= B(D) – C (D: Delay, C: Cost, B(t): The

budget function)

Method: Greedy Algorithm– Pre-Step: Get the least D result– Iteration Steps:

• Calculate Cost Gradient CG:= cost reduce/delay increase for each stride

• Keep substitute using MAX(CG) until no increase on difference

04/21/23 EECS 584, Fall 2011 20

The bidding process: Separate Bidder Network bidder

– One trip to get

bandwidth– Return trip to

get price– Happen at

second

04/21/23 EECS 584, Fall 2011 21

Storage Manager An asynchronous process which runs in

tandem with the bidder Objective

– Maximize revenue income per unit time Functions

– Calculate Fragment Values– Buy Fragments– Sell Fragments– Split/Coalesce Fragments

04/21/23 EECS 584, Fall 2011 22

Fragment Values The value of a fragment is defined using the revenue

history Revenue history consists of

– Query, No of records in result, time since last query, last revenue, delay, CPU & I/O used

CPU & I/O is normalized & stored in site independent units

Each site should– Convert this CPU & I/O units to site specific units via

weighting functions– Adjust revenue as the current node maybe faster or slower

by using the average bid curve

04/21/23 EECS 584, Fall 2011 23

Buying fragments In order to bid for a query/subquery the site should have

the referenced fragments The site can buy the fragments in advance (prefetch) or

when the query comes in (on demand) The buyer locates the owner of fragment and requests

revenue history Calculates the value of fragment Evict old fragments to free up space (alternate fragments)

– To the extent that space is available for new fragments

Buyer Offer price = value of fragment – value of alternate fragments + price received

04/21/23 EECS 584, Fall 2011 24

Selling Fragments Seller can evict the fragment being bought or

any other fragment(alternate) of equivalent size (Why is this a must?)

Seller will sell if – offer price > value of fragment (sell) – value of

alternate fragments + price received

If offer price is not sufficient, – then seller tries to evict fragment of higher value – lower the price of fragment as a final option

04/21/23 EECS 584, Fall 2011 25

Split & Coalesce When to Split/Coalesce?

– Split if there are too few fragments otherwise parallelization will take a hit

– Coalesce if there are too many fragments as overhead of dealing with the fragments & response time will take a hit

The algorithm for split/coalesce must strike the correct balance between the two

04/21/23 EECS 584, Fall 2011 26

How to solve this issue???An interlude

04/21/23 EECS 584, Fall 2011 27

Why not extend my microeconomics analogy!?!

Stonebreaker’s Microeconomics Idea

Market pressure should correct inappropriate fragment sizes

Large fragment size =>

Now everyone wants a share of the pie

But the owner does not want to lose

the revenue!

04/21/23 EECS 584, Fall 2011 28

The Idea Continued Break the large fragment

into smaller fragments

Smaller fragment means less revenue & less attractive for copies

04/21/23 EECS 584, Fall 2011 29

It still continues…. Smaller fragments also mean more

overhead => Works against the owner!

04/21/23 EECS 584, Fall 2011 30

And it ends…

So depending on the market demand these two opposing motivations will balance each other

04/21/23 EECS 584, Fall 2011 31

How to solve this issue???

04/21/23 EECS 584, Fall 2011 32

A more “concrete” approach !!

A more “concrete” approach... Mariposa will calculate expected delay (ED) due to

parallel execution on multiple fragments (Numc) It then computes the expected bid per site as

– B(ED)/Numc

Vary Numc to arrive at maximum revenue per site => Num*

Sites will keep track of this Num* to base their split/coalesce decision

**The sites should also ensure that the existing contracts are not affected

04/21/23 EECS 584, Fall 2011 33

Name Service Architecture

04/21/23 EECS 584, Fall 2011 34

Broker Name Service

Name server

Local sites

What are the different types of names?

Internal names: They are location dependent and carries info related to the physical location of the object

Full Names: They uniquely identify an object, are location independent & carries full info related to attributes of object

Common Names: They are user defined & defined within a name space.

Simple rules help translate common names to full names The missing components are usually derived from the

parameters supplied by the user or from the user’s environment

Name Context: This is similar to access modifiers in programming languages.

04/21/23 EECS 584, Fall 2011 35

How are names resolved? Name resolution helps discover the object

that is bound to a name– Common Name => Full Name– Full Name => Internal Name

The broker employs the following steps to resolve a name– Searches local cache– Rule driven search to resolve ambiguities– Query one or more name servers

04/21/23 EECS 584, Fall 2011 36

How is QOS of name Servers Defined?

Name servers helps translate common names to full names using name contexts provided by clients

Name service contacts various name servers Each name server maintains a composite set

of metadata of local sites under them It’s the role of name server to periodically

update its catalog QOS is defined as the combination of price &

staleness of this data

04/21/23 EECS 584, Fall 2011 37

Experiment

04/21/23 EECS 584, Fall 2011 38

The Query:– SELECT *

FROM R1(SB), R2(B), R3(SD)

WHERE R1.u1 = R2.u1

AND R2.u1 = R3.u1

The following statistics are available to the optimizer– R1 join R2 (1MB)– R2 join R3 (3MB)– R1 join R2 join R3 (4.5MB)

04/21/23 EECS 584, Fall 2011 39

The traditional Distributed RDBMS plans a query & sends the sub queries to the processing sites which is the same as purchase order protocol

Therefore the overhead due to Mariposa is the difference in elapsed time between the two protocols weighted by proportion of queries using the protocols

Bid price = (1.5 x estimated cost) x load average– Load average = 1

A node will sell a fragment if– Offer price > 2 X scan cost / load average

Decision to buy a fragment rather than subcontract is based on– Sale price <= Total money spent on scans

04/21/23 EECS 584, Fall 2011 40

The query optimizer chooses a plan based on the data transferred across the network

The initial plan generated by both Mariposa and the traditional systems will be similar

But due to migration of fragments subsequent executions of the same query will generate much better plans

04/21/23 EECS 584, Fall 2011 41

04/21/23 EECS 584, Fall 2011 42

04/21/23 EECS 584, Fall 2011 43

GFS - Topics Covered Motivation Architectural/File System Hierarchical

Overview Read/Write/Append/Snapshot Operation Key Design Parameters Replication & Rebalancing Garbage Collection Fault Tolerance

Motivation

Customized Needs Reliability Availability Performance Scalability

04/21/23 EECS 584, Fall 2011 44

Customized Needs How is it different?

Runs on commodity hardware where failure is not an exception rather an expectation (PC Vs Mac anyone?)

Huge Files (in the order of Multi-GBs) Writes involve only appending data unlike

traditional systems Applications that use these systems are in-

house! Files stored are primarily web documents

04/21/23 EECS 584, Fall 2011 45

04/21/23 EECS 584, Fall 2011 46

GFS - Topics Covered Motivation Architectural/File System Hierarchical

File System Hierarchy

04/21/23 EECS 584, Fall 2011 47

mariposa the google file system

bidding processeecs

bidding processthe broker

computed bid

bid acceptancethe main

bidderthe bidder

site acts

query planthe bidder

brokerthe brokerinput

Documents

2004sys3 google file shutter speed

presentacion google file system

advanced google file system

welcome to mariposa! old town historic mariposa this ... ›...

1004 kmz file on google earth from google drive

google places help file

distributed file systems (from google)

gfs - google file system presentation

google file systems

google file system - gfs

válvulas borboleta xurox - inoxpal...valvllas de mariposa...

o google file system

gfs : google file system

google doodle file history

upload file ke google drives

google file system report

gfs: google file...

print - ????? ??? ????? ???? ??? ?????? ????? - google docs...

gfs - google file system

google sites - fis psu phuket · pdf...