google file system

21
GFS: The Google File System Avinash Kumar BE Computer-2 Roll No-40

Upload: guest2cb4689

Post on 06-May-2015

4.495 views

Category:

Education


7 download

DESCRIPTION

Google File System

TRANSCRIPT

Page 1: Google File System

GFS: The Google File System

Avinash KumarBE Computer-2

Roll No-40

Page 2: Google File System

ContentsIntroduction to GFSSystem ArchitectureSystem FeaturesWorking of GFSLatest advancementConclusionQuestions

Page 3: Google File System

IntroductionMore than 15,000 commodity-class PC's.Multiple clusters distributed worldwide.Thousands of queries served per second.One query reads 100's of MB of data.One query consumes 10's of billions of CPU

cycles.Google stores dozens of copies of the entire Web!

Conclusion: Need large, distributed, highly fault tolerant file system.

Page 4: Google File System

System Architecture· A GFS cluster consists of a single master and multiple chunk-servers and is accessed by multiple clients.

Page 5: Google File System

Large Chunk

GFS uses large chunk: 64MB (1G = 1024 MB = 16 chunks) Stored as a plain Linux file, which will be lazily extended up to 64MB.

Opt to many read and write on a given chunkReduces network overhead by keeping a connection to the chunk server.See also Map-Reduce, Big-Table.

Page 6: Google File System

Architecture (cont’d)· Chunkserver

Files are divided into fixed-size chunks (64 MB) Each chunk is identified by an immutable and globally unique 64 bit

chunkhandle assigned by the master at the time of chunkcreation Chunkservers store chunks on local disks as Linux files and read or write

chunk data specified by a chunkhandle For reliability, each chunk is replicated on multiple chunkservers.

(default 3 replicas)

· GFS Client GFS client code linked into each application implements the file system

API and communicates with the master and chunkservers to read or write data on behalf of the application

Page 7: Google File System

System Metadata· The master stores three major types of metadata:

The file and chunk namespaces The mapping from files to chunks The locations of each chunk’s replicas

· All metadata is kept in the master’s memory · The first two types are also kept persistent by logging

mutations to an operation log stored on the master’s local disk and replicated on remote machines.

· The master does not store third type persistently. Instead, it asks each chunkserver about its chunks at master startup

Page 8: Google File System

System FeaturesPage Rank- Probability that a random surfer visits the site

• Citations (Back links)• How is Page Rank calculated??

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) where,PR -> Page Rank of a pageT1….Tn -> Pages that point to Page A (citations)d -> Damping Factor (0<d<1)C(A) -> No. of Links going out from A

Page Rank of a page depends on- Number of pages pointing to it. Page Rank of the page that points to it.

Page 9: Google File System

System FeaturesAnchor Text- text associated with the link

• Association with the page the link is on• Association with the page the link points to( unique to

Google)Advantages:• Anchors contain more information than the pages

themselves• Documents that cannot be indexed can be displayed

Other Features:• Proximity of location information in search for all hits• Track of visual presentation details

Page 10: Google File System

System Anatomy

Page 11: Google File System

Working Of GFS

Page 12: Google File System

Google Query Evaluation1. Parse the query.

2. Convert words into wordIDs.

3. Seek to the start of the doclist in the short barrel for every word.

4. Scan through the doclists until there is a document that matches all the search terms.

5. Compute the rank of that document for the query.

6. If in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4.

7. If we are not at the end of any doclist go to step 4.

Sort the documents that have matched by rank and return the top k.

Page 13: Google File System

Client ReadClient sends master:

read(file name, chunk index)Master’s reply:

chunk ID, chunk version number, locations of replicas

Client sends “closest” chunkserver w/replica:read(chunk ID, byte range)“Closest” determined by IP address on simple rack-

based network topologyChunkserver replies with data

Page 14: Google File System

Client WriteSome chunkserver is primary for each chunk

Master grants lease to primary (typically for 60 sec.)Leases renewed using periodic heartbeat messages

between master and chunkserversClient asks server for primary and secondary replicas

for each chunkClient sends data to replicas in daisy chain

Pipelined: each replica forwards as it receivesTakes advantage of full-duplex Ethernet links

Page 15: Google File System

Client Write (2)All replicas acknowledge data write to clientClient sends write request to primaryPrimary assigns serial number to write request,

providing orderingPrimary forwards write request with same serial

number to secondariesSecondaries all reply to primary after completing

writePrimary replies to client

Page 16: Google File System

Client Write (3)

Page 17: Google File System

What Happen If the Master Reboots?

Replays log from diskRecovers namespace (directory) informationRecovers file-to-chunk-ID mapping

Asks chunkservers which chunks they holdRecovers chunk-ID-to-chunkserver mapping

If chunk server has older chunk, it’s staleChunk server down at lease renewal

If chunk server has newer chunk, adopt its version numberMaster may have failed while granting lease

Page 18: Google File System

What Happen if Chunkserver Fails?

Master notices missing heartbeatsMaster decrements count of replicas for all chunks

on dead chunkserverMaster re-replicates chunks missing replicas in

backgroundHighest priority for chunks missing greatest

number of replicas

Page 19: Google File System

Latest AdvancementgMail - An easily configurable email

service with 1GB of web space.Blogger- A free web-based service that

helps consumers publish on the web without writing code or installing

software. Google “next generation corporate s/w”

- A smaller version of the google software, modified for private use.

Page 20: Google File System

ConclusionSuccess: used actively by Google to support search service and

other applicationsAvailability and recoverability on cheap hardwareHigh throughput by decoupling control and dataSupports massive data sets and concurrent appends

Semantics not transparent to appsMust verify file contents to avoid inconsistent regions,

repeated appends (at-least-once semantics)Performance not good for all apps

Assumes read-once, write-once workload (no client caching!)

Page 21: Google File System

Thank you

ANY QUESTION ?