mariposa the google file system
DESCRIPTION
Mariposa The Google File System. Haowei Lu Madhusudhanan Palani. From LAN to WAN. Drawbacks of traditional distributed DBMS Static Data Allocation Move objects manually Single Administrative Structure Cost-based optimizer cannot scale well Uniformity Different Machine Architecture - PowerPoint PPT PresentationTRANSCRIPT
04/21/23 EECS 584, Fall 2011 1
MariposaThe Google File System
Haowei Lu
Madhusudhanan Palani
04/21/23 EECS 584, Fall 2011 2
From LAN to WAN
Drawbacks of traditional distributed DBMS– Static Data Allocation
• Move objects manually
– Single Administrative Structure• Cost-based optimizer cannot scale well
– Uniformity• Different Machine Architecture
• Different Data Type
04/21/23 EECS 584, Fall 2011 3
From LAN to WAN
New requirements– Scalability to a large number of
cooperating sites:– Data mobility– No global synchronization– Total local autonomy– Easily configurable policies
04/21/23 EECS 584, Fall 2011 4
From LAN to WAN
Solution – A distributed microeconomic approach– Well studied economic model– Reduce scheduling complexity(?!)– Invisible hands for local optimum
04/21/23 EECS 584, Fall 2011 5
Mariposa
Let each site acts on its own behalf to maximize his own profit
In turn, it brings the overall performance of the DBMS ecosystem
04/21/23 EECS 584, Fall 2011 6
Architecture - Glossary
Fragment – Units of storage that are bought and sold by sites– Range distribution– Hash-Based distribution– Unstructured! Whenever site wants!
Stride– Operations that can proceed in parallel
04/21/23 EECS 584, Fall 2011 7
Architecture
04/21/23 EECS 584, Fall 2011 8
04/21/23 EECS 584, Fall 2011 9
The bidding process
04/21/23 EECS 584, Fall 2011 10
The bidding process The Broker: Send out requests for bid for query
plan The Bidder: Responds to the request for bid with
its formulated price and other information in the form:– (C,D,E) Cost, Delay, Expiration Date
The whole logic is implemented using RUSH– A low level, very efficient embedded scripting
language and rule system– Form: on <condition> do <action>
04/21/23 EECS 584, Fall 2011 11
The bidding process: Bidder
The Bidder: Setting the price for bid– Billing rate on a per-fragment basis– Consider site load
• Actual Bid = Computed bid * Load average
– Bid referencing hot list from storage manager
04/21/23 EECS 584, Fall 2011 12
The bidding process
04/21/23 EECS 584, Fall 2011 13
The bidding process: Broker
The Broker– Input: Fragmented query plan– In process: Decide the sites to run
fragments & send out bid acceptance• Expensive bid protocol• Purchase order protocol (mainly used)
– Output: hand off task to coordinator
Expensive bid protocol
04/21/23 EECS 584, Fall 2011 14
The bidding process: Broker
Broker
Bidder(Individual Sites)
Ads Table(Locate at Name Server)
Bookkeeping Table for previous winner sites(Same site as broker)
Under Budge
Purchase Order Protocol
04/21/23 EECS 584, Fall 2011 15
The bidding process: Broker
Broker The most possible bidder
RefuseAccept
GenerateBill
Pass to anotherSiteReturn to Broker
04/21/23 EECS 584, Fall 2011 16
The bidding process: Broker The Broker finds bidder using Ad table
04/21/23 EECS 584, Fall 2011 17
The bidding process: Broker The Broker finds bidder using Ad table Example (Sale Price)
– Query-Template: SELECT * FROM TMP– Sever Id: 123– Start Time: 2011/10/01– Expiration Time: 2011/10/04– Price: 10 unit– Delay: 5 seconds
04/21/23 EECS 584, Fall 2011 18
The bidding process: Broker Type of Ads (REALY FANCY)
04/21/23 EECS 584, Fall 2011 19
The bidding process: Bid Acceptance The main idea: Make the difference large as
possible– Difference:= B(D) – C (D: Delay, C: Cost, B(t): The
budget function)
Method: Greedy Algorithm– Pre-Step: Get the least D result– Iteration Steps:
• Calculate Cost Gradient CG:= cost reduce/delay increase for each stride
• Keep substitute using MAX(CG) until no increase on difference
04/21/23 EECS 584, Fall 2011 20
The bidding process: Separate Bidder Network bidder
– One trip to get
bandwidth– Return trip to
get price– Happen at
second
stage
04/21/23 EECS 584, Fall 2011 21
Storage Manager An asynchronous process which runs in
tandem with the bidder Objective
– Maximize revenue income per unit time Functions
– Calculate Fragment Values– Buy Fragments– Sell Fragments– Split/Coalesce Fragments
04/21/23 EECS 584, Fall 2011 22
Fragment Values The value of a fragment is defined using the revenue
history Revenue history consists of
– Query, No of records in result, time since last query, last revenue, delay, CPU & I/O used
CPU & I/O is normalized & stored in site independent units
Each site should– Convert this CPU & I/O units to site specific units via
weighting functions– Adjust revenue as the current node maybe faster or slower
by using the average bid curve
04/21/23 EECS 584, Fall 2011 23
Buying fragments In order to bid for a query/subquery the site should have
the referenced fragments The site can buy the fragments in advance (prefetch) or
when the query comes in (on demand) The buyer locates the owner of fragment and requests
revenue history Calculates the value of fragment Evict old fragments to free up space (alternate fragments)
– To the extent that space is available for new fragments
Buyer Offer price = value of fragment – value of alternate fragments + price received
04/21/23 EECS 584, Fall 2011 24
Selling Fragments Seller can evict the fragment being bought or
any other fragment(alternate) of equivalent size (Why is this a must?)
Seller will sell if – offer price > value of fragment (sell) – value of
alternate fragments + price received
If offer price is not sufficient, – then seller tries to evict fragment of higher value – lower the price of fragment as a final option
04/21/23 EECS 584, Fall 2011 25
Split & Coalesce When to Split/Coalesce?
– Split if there are too few fragments otherwise parallelization will take a hit
– Coalesce if there are too many fragments as overhead of dealing with the fragments & response time will take a hit
The algorithm for split/coalesce must strike the correct balance between the two
04/21/23 EECS 584, Fall 2011 26
How to solve this issue???An interlude
04/21/23 EECS 584, Fall 2011 27
Why not extend my microeconomics analogy!?!
Stonebreaker’s Microeconomics Idea
Market pressure should correct inappropriate fragment sizes
Large fragment size =>
Now everyone wants a share of the pie
But the owner does not want to lose
the revenue!
04/21/23 EECS 584, Fall 2011 28
The Idea Continued Break the large fragment
into smaller fragments
Smaller fragment means less revenue & less attractive for copies
04/21/23 EECS 584, Fall 2011 29
It still continues…. Smaller fragments also mean more
overhead => Works against the owner!
04/21/23 EECS 584, Fall 2011 30
And it ends…
So depending on the market demand these two opposing motivations will balance each other
04/21/23 EECS 584, Fall 2011 31
How to solve this issue???
04/21/23 EECS 584, Fall 2011 32
A more “concrete” approach !!
A more “concrete” approach... Mariposa will calculate expected delay (ED) due to
parallel execution on multiple fragments (Numc) It then computes the expected bid per site as
– B(ED)/Numc
Vary Numc to arrive at maximum revenue per site => Num*
Sites will keep track of this Num* to base their split/coalesce decision
**The sites should also ensure that the existing contracts are not affected
04/21/23 EECS 584, Fall 2011 33
Name Service Architecture
04/21/23 EECS 584, Fall 2011 34
Broker Name Service
Name server
Name server
Name server
Local sites
What are the different types of names?
Internal names: They are location dependent and carries info related to the physical location of the object
Full Names: They uniquely identify an object, are location independent & carries full info related to attributes of object
Common Names: They are user defined & defined within a name space.
Simple rules help translate common names to full names The missing components are usually derived from the
parameters supplied by the user or from the user’s environment
Name Context: This is similar to access modifiers in programming languages.
04/21/23 EECS 584, Fall 2011 35
How are names resolved? Name resolution helps discover the object
that is bound to a name– Common Name => Full Name– Full Name => Internal Name
The broker employs the following steps to resolve a name– Searches local cache– Rule driven search to resolve ambiguities– Query one or more name servers
04/21/23 EECS 584, Fall 2011 36
How is QOS of name Servers Defined?
Name servers helps translate common names to full names using name contexts provided by clients
Name service contacts various name servers Each name server maintains a composite set
of metadata of local sites under them It’s the role of name server to periodically
update its catalog QOS is defined as the combination of price &
staleness of this data
04/21/23 EECS 584, Fall 2011 37
Experiment
04/21/23 EECS 584, Fall 2011 38
The Query:– SELECT *
FROM R1(SB), R2(B), R3(SD)
WHERE R1.u1 = R2.u1
AND R2.u1 = R3.u1
The following statistics are available to the optimizer– R1 join R2 (1MB)– R2 join R3 (3MB)– R1 join R2 join R3 (4.5MB)
04/21/23 EECS 584, Fall 2011 39
The traditional Distributed RDBMS plans a query & sends the sub queries to the processing sites which is the same as purchase order protocol
Therefore the overhead due to Mariposa is the difference in elapsed time between the two protocols weighted by proportion of queries using the protocols
Bid price = (1.5 x estimated cost) x load average– Load average = 1
A node will sell a fragment if– Offer price > 2 X scan cost / load average
Decision to buy a fragment rather than subcontract is based on– Sale price <= Total money spent on scans
04/21/23 EECS 584, Fall 2011 40
The query optimizer chooses a plan based on the data transferred across the network
The initial plan generated by both Mariposa and the traditional systems will be similar
But due to migration of fragments subsequent executions of the same query will generate much better plans
04/21/23 EECS 584, Fall 2011 41
04/21/23 EECS 584, Fall 2011 42
04/21/23 EECS 584, Fall 2011 43
GFS - Topics Covered Motivation Architectural/File System Hierarchical
Overview Read/Write/Append/Snapshot Operation Key Design Parameters Replication & Rebalancing Garbage Collection Fault Tolerance
Motivation
Customized Needs Reliability Availability Performance Scalability
04/21/23 EECS 584, Fall 2011 44
Customized Needs How is it different?
Runs on commodity hardware where failure is not an exception rather an expectation (PC Vs Mac anyone?)
Huge Files (in the order of Multi-GBs) Writes involve only appending data unlike
traditional systems Applications that use these systems are in-
house! Files stored are primarily web documents
04/21/23 EECS 584, Fall 2011 45
04/21/23 EECS 584, Fall 2011 46
GFS - Topics Covered Motivation Architectural/File System Hierarchical
Overview Read/Write/Append/Snapshot Operation Key Design Parameters Replication & Rebalancing Garbage Collection Fault Tolerance
File System Hierarchy
04/21/23 EECS 584, Fall 2011 47
Directory
File 1
Chunk2
File n
Chunk1
Chunk0 Chunk3
Chunk4
Chunk5
Chunk servers
Master Server
64 Bit Globally Unique Ids
Types of servers Master server holds all meta data information such as
– Directory => File mapping
– File => Chunk mapping
– Chunk location
It keeps in touch with the chunk servers via heartbeat messages
Chunk servers store the actual chunks on local disks as Linux files
For reliability purposes chunks maybe replicated across multiple chunk servers
04/21/23 EECS 584, Fall 2011 48
04/21/23 EECS 584, Fall 2011 49
GFS - Topics Covered Motivation Architectural /File System Hierarchical
Overview Read/Write/Append/Snapshot Operation Key Design Parameters Replication & Rebalancing Garbage Collection Fault Tolerance
04/21/23 EECS 584, Fall 2011 50
Read Operation Using the fixed chunk size & user provided filename &
byte offset, the client translates it into a chunk index The filename & chunk index is then sent to the master to
get chunk location & the replica locations Client caches (limited time) this info using filename &
chunk index as key Client directly communicates with the closest chunk
server In order to minimize Client-Master interaction, the client
bunches up chunk location requests & master also sends chunks next to requested ones.
04/21/23 EECS 584, Fall 2011 51
Write Operation
04/21/23 EECS 584, Fall 2011 52
Write Operation1. The client requests a chunk from the master
2. Master assigns a chunk lease(60 seconds renewable) to a primary among replicas
3. The client then pushes the data to be written to the nearest chunk server1. Each chunk server in turn pushes this data into the next nearest
server
2. This ensures that the network bandwidth is fully utilized
4. Once all replicas have the data, the client pushes the write request to the primary1. The primary determines the order of mutations based on multiple
requests it receives from a single/multiple client(s)
04/21/23 EECS 584, Fall 2011 53
Write Operation5. The primary then pushes this ordering information to all replicas
6. The replicas then acknowledge the primary once the mutations have been successfully applied
7. The primary then acknowledges the client
Data flow is decoupled from control flow to ensure that the network topology dictates the throughput and not the choice of primary
Distance between two nodes is calculated by use of IP addresses
Use of switched network with full duplex links allows servers to forward data as soon as they start receiving it
04/21/23 EECS 584, Fall 2011 54
Record Append Operations Appends data to a file at least once atomically and returns the offset
back to the client
1. Client pushes data to all replicas
2. Sends request to the primary
3. Primary checks if chunk size would be exceeded1. If so pads the extra space of old chunk, creates a new chunk, instructs replicas to
do so and asks client to retry with new chunk
2. Else writes to chunk, instructs replicas to do so
4. If an append fails at any replica, the client retries the operation Single most commonly used operation by all distributed applications
in Google to write concurrently to a file This operation allows simple coordination schemes rather than
complex distributed locking mechanisms used in traditional writes
04/21/23 EECS 584, Fall 2011 55
Snapshot Operations This is used by applications for checkpointing their progress Creates an instant copy of file or directory tree while minimizing
interruptions to ongoing mutations Master revokes any outstanding leases on the chunks Master duplicates the meta data and it continues to point to
same chunk Upon first write request the master asks chunk server to
replicate the chunks Chunk is created on same chunk server thereby avoiding
network traffic
04/21/23 EECS 584, Fall 2011 56
04/21/23 EECS 584, Fall 2011 57
GFS - Topics Covered Motivation Architectural /File System Hierarchical
Overview Read/Write/Append/Snapshot Operation Key Design Parameters Replication & Rebalancing Garbage Collection Fault Tolerance
Replication & Rebalancing Chunks are replicated both across racks and within
racks This not only boosts availability, reliability etc but also
exploits aggregate bandwidth for reads Placement of chunks (balancing) depends on several
factors– To even out disk utilization across servers
– Limit the number of recent creations on chunk servers
– Spread replicas across racks
No of replicas is configurable and Master ensures that it doesn’t go below the threshold
04/21/23 EECS 584, Fall 2011 58
Replication & Rebalancing Priority on which chunks to re-replicate is assigned
by the master based on various factors like – distance from threshold– Live chunks over deleted chunks– Chunks blocking progress of clients
Master as well as clients throttle the cloning operations to ensure that they do not interfere with regular operations
Master also does periodic rebalancing for better load balancing & disk space utilization
04/21/23 EECS 584, Fall 2011 59
04/21/23 EECS 584, Fall 2011 60
GFS - Topics Covered Motivation Architectural /File System Hierarchical
Overview Read/Write/Append/Snapshot Operation Key Design Parameters Replication & Rebalancing Garbage Collection Fault Tolerance
Garbage Collection A file is logged for deletion by the master The file is not reclaimed immediately and is renamed to a
hidden name with deletion timestamp The master reclaims these files during its periodic scan if they
are older than 3 days While reclaiming the in memory metadata is erased, thus
severing its link to its chunks In a similar scan of chunk space the master identifies
orphaned chunks & erases corresponding metadata The files are reclaimed by the chunkservers upon confirmation
during regular heartbeat messages Stale replicas are also collected using version numbers
04/21/23 EECS 584, Fall 2011 61
Stale Replica Detection Each chunk is associated with a version number both
maintained by both the master and the chunk server The version number is incremented whenever a new
lease is granted If the chunk server version lags behind the master’s
version the chunk is marked for GC If the master’s version lags behind the chunk server, the
master is updated Also this version number is included in all sorts of
communications so that the client/chunk server can verify the version number before performing any operation
04/21/23 EECS 584, Fall 2011 62
04/21/23 EECS 584, Fall 2011 63
GFS - Topics Covered Motivation Architectural /File System Hierarchical
Overview Read/Write/Append/Snapshot Operation Key Design Parameters Replication & Rebalancing Garbage Collection Fault Tolerance
Fault Tolerance Both master and chunk server are designed
to restore their state and start in seconds Replication of chunks across racks and within
racks ensure high availability Monitoring infrastructure outside GFS
monitors master failure and starts a new master process within the replicated master servers
“Shadow” masters provide read-only access even though primary master is down
04/21/23 EECS 584, Fall 2011 64
Fault Tolerance A shadow master periodically applies the
growing primary master log to itself to keep up to date
It also periodically shares heartbeat messages with chunk servers to locate replicas
Integrity of data is maintained through checksums at both the servers
This verification is done during any read, write, or chunk migration request & also periodically
04/21/23 EECS 584, Fall 2011 65
Benchmark
04/21/23 EECS 584, Fall 2011 66
Measurements & Results
04/21/23 EECS 584, Fall 2011 67
04/21/23 EECS 584, Fall 2011 68
04/21/23 EECS 584, Fall 2011 69
04/21/23 EECS 584, Fall 2011 70
Key Design Parameters The choice of chunk size (64 MB) combined with the
nature of read/write offers several advantages:– Reduces client-master interaction– Many operations on the same chunk more likely– Reduces the size of meta data(can be held in primary
memory)
But hotspots can develop when many clients request the same chunk
But this can be suppressed with replication, staggered application start ups, P2P etc
04/21/23 EECS 584, Fall 2011 71
Key Design Parameters Metadata information is not persistent, it is collected
via heartbeat messages and is stored in main memory
This eliminates the need to keep the master in sync whenever chunk servers join/leave the cluster
Also given the chunk size, the metadata information to be stored in memory is greatly reduced
This small size also allows for periodic scanning of metadata for garbage collection, re replication & chunk migration without incurring much overhead
04/21/23 EECS 584, Fall 2011 72
Key Design Parameters Operation log maintains the transactional
information in GFS It employs checkpointing to keep the log size &
recovery time low These logs are replicated and located in multiple
servers to ensure reliability Any response to clients are provided only after the
logs are flushed to all these replicas
04/21/23 EECS 584, Fall 2011 73